Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SEARCHING THE RESULTS OF AN AUTOMATIC SPEECH RECOGNITION PROCESS
Document Type and Number:
WIPO Patent Application WO/2017/020011
Kind Code:
A1
Abstract:
Various disclosed implementations involve searching the results of an automatic speech recognition (ASR) process, such an ASR process that has been performed on a recording of a monologue or of a conference, such as a teleconference or a video conference. An initial search query, including at least one search word, may be received. The initial search query may be analyzed according to phonetic similarity and semantic similarity. An expanded search query may be determined according to the phonetic similarity, the semantic similarity, or both the phonetic similarity and the semantic similarity. A search of the speech recognition results data may be performed according to the expanded search query. Some aspects of this disclosure involve playing back audio data that corresponds with such search results.

Inventors:
HUANG SHEN (CN)
CARTWRIGHT RICHARD J (AU)
Application Number:
PCT/US2016/044878
Publication Date:
February 02, 2017
Filing Date:
July 29, 2016
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
DOLBY LABORATORIES LICENSING CORP (US)
International Classes:
G10L15/22; G10L15/26; G10L25/54; H04M3/42
Other References:
CHELBA C ET AL: "Retrieval and browsing of spoken content", IEEE SIGNAL PROCESSING MAGAZINE, IEEE SERVICE CENTER, PISCATAWAY, NJ, US, vol. 25, no. 3, 1 May 2008 (2008-05-01), pages 39 - 49, XP011226390, ISSN: 1053-5888, DOI: 10.1109/MSP.2008.917992
CHEN ZHIPENG ET AL: "Improving keyword search by query expansion in a probabilistic framework", THE 9TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, IEEE, 12 September 2014 (2014-09-12), pages 187 - 191, XP032669176, DOI: 10.1109/ISCSLP.2014.6936639
BARRY MICHAEL ARONS: "Interactively skimming recorded speech", PHD DISSERTATION, MASSACHUSETTS INSTITUTE OF TECHNOLOGY, 1 February 1994 (1994-02-01), XP055140419, Retrieved from the Internet [retrieved on 20140916]
MARTHA LARSON ET AL: "Spoken Content Retrieval - A Survey of Techniques and Technologies", FOUNDATIONS AND TRENDS IN INFORMATION RETRIEVAL, vol. 5, no. 3, 1 January 2012 (2012-01-01), US, pages 235 - 422, XP055311442, ISSN: 1554-0669, DOI: 10.1561/1500000020
Attorney, Agent or Firm:
DOLBY LABORATORIES, INC. et al. (US)
Download PDF:
Claims:
CLAIMS

We claim:

1. A method for processing audio data, the method comprising:

receiving speech recognition results data for at least a portion of an audio recording; receiving an initial search query including at least one search word;

analyzing the initial search query according to phonetic similarity and semantic similarity;

determining an expanded search query according to the phonetic similarity, the semantic similarity, or both the phonetic similarity and the semantic similarity; and

performing a search of the speech recognition results data according to the expanded search query.

2. The method of claim 1, wherein the audio recording includes at least a portion of a recording of a conference involving a plurality of conference participants.

3. The method of claim 2, wherein the speech recognition results data includes a plurality of speech recognition lattices and a word recognition confidence score for each of a plurality of hypothesized words of the speech recognition lattices, the word recognition confidence score corresponding with a likelihood of a hypothesized word correctly corresponding with an actual word spoken by a conference participant during the conference.

4. The method of any one of claims 1-3, wherein analyzing the initial search query involves analyzing syllables and phonemes of the initial search query.

5. The method of any one of claims 1-4, wherein determining the expanded search query involves a selecting process of determining candidate search query terms and selecting candidate search query terms to produce a refined search term list.

6. The method of claim 5, wherein the selecting process involves selecting candidate search query terms according to at least one criterion selected from a list of criteria consisting of: a limit of sub-word unit insertions; a limit of sub-word unit deletions; a limit of sub-word unit substitutions; exclusion of words not found in a reference dictionary; a phonetic similarity cost threshold; a semantic similarity cost threshold; and user input.

7. The method of claim 6, wherein the selecting process involves at least one type of user input selected from a group of user input types consisting of: a vocabulary size; an indication of an importance of phonetic similarity; an indication of an importance of semantic similarity; and an indication of a relative importance of reducing a miss rate and reducing a false positive rate.

8. The method of any one of claims 1-7, wherein the search involves a word unit search and a sub-word unit search.

9. The method of claim 8, further comprising:

determining a word unit search score and a sub-word unit search score; and combining the word unit search score and the sub-word unit search score to determine a total search score.

10. The method of any one of claims 1-9, wherein analyzing the initial search query according to phonetic similarity involves applying a phonetic confusion model that is based, at least in part, on a phoneme and syllable cost matrix and a sub-word unit to word unit transducer.

11. The method of any one of claims 1-10, wherein analyzing the initial search query according to semantic similarity involves applying a semantic confusion model that is based, at least in part, on a conference vocabulary, a lexical database for a language or both the conference vocabulary and the lexical database.

12. The method of any one of claims 1-11, further comprising returning search results corresponding to the search.

13. The method of claim 12, further comprising:

selecting, from the audio recording, playback audio data comprising one or more instances of speech that include the search results; and

providing the playback audio data for playback on a speaker system.

14. The method of claim 13, further comprising scheduling at least a portion of the instances of speech for simultaneous playback.

15. The method of claim 13 or claim 14, wherein the audio recording includes at least a portion of a recording of a conference involving a plurality of conference participants, further comprising receiving an indication of a selected conference participant chosen by a user from among the plurality of conference participants, wherein selecting the playback audio data involves selecting one or more instances of speech of the conference recording that include speech by the selected conference participant that include the search results.

Description:
SEARCHING THE RESULTS OF AN

AUTOMATIC SPEECH RECOGNITION PROCESS

CROSS-REFERENCE TO RELATED APPLICATIONS

[0001] The present invention claims the benefit of International Patent Application No. PCT/CN2015/085570 filed on 30 July 2015, and United States Provisional Patent Application No. 62/210,537 filed on 27 August 2015, each of which is hereby incorporated by reference in their entirety.

TECHNICAL FIELD

[0002] This disclosure relates to searching the results of an automatic speech recognition (ASR) process. In particular, this disclosure relates to searching the results of an ASR process that has been performed on a recording of a conference, such as a

teleconference or a video conference. Some aspects of this disclosure involve playing back such search results.

BACKGROUND

[0003] Conference recordings typically include a large amount of audio data, which may include a substantial amount of babble and non-substantive discussion. Locating relevant meeting topics via audio playback can be very time-consuming. ASR has sometimes been used to convert meeting recordings to text to enable text-based search and browsing.

[0004] Unfortunately, accurate meeting transcription based on automatic speech recognition has proven to be a challenging task. For example, the leading benchmark from the National Institute of Standards and Technology (NIST) has shown that although the word error rate (WER) for ASR of various types of speech has declined substantially in recent decades, the WER for meeting speech has remained substantially higher than the WER for other types of speech. According to a NIST report published in 2007, the WER for meeting speech was typically more than 25%, and frequently more than 50%, for meetings involving multiple conference participants. (Fiscus, Jonathan G., et ah, "The Rich Transcription 2007 Meeting Recognition Evaluation" (NIST 2007).)

SUMMARY

[0005] According to some implementations disclosed herein, a method may involve receiving speech recognition results data for at least a portion of an audio recording and receiving an initial search query including at least one search word. Some methods may involve analyzing the initial search query according to phonetic similarity and semantic similarity and determining an expanded search query according to the phonetic similarity, the semantic similarity, or both the phonetic similarity and the semantic similarity. Some implementations may involve performing a search of the speech recognition results data according to the expanded search query.

[0006] In some examples, the audio recording may include at least a portion of a recording of a conference involving a plurality of conference participants. In some such examples, the speech recognition results data may include a plurality of speech recognition lattices and a word recognition confidence score for each of a plurality of hypothesized words of the speech recognition lattices. The word recognition confidence score may correspond with a likelihood of a hypothesized word correctly corresponding with an actual word spoken by a conference participant during the conference.

[0007] According to some implementations, analyzing the initial search query may involve analyzing syllables and phonemes of the initial search query. In some examples, determining the expanded search query may involve a selecting process of determining candidate search query terms and selecting candidate search query terms to produce a refined search term list.

[0008] According to some examples, the selecting process may involve selecting candidate search query terms according to one or more of the following: a limit of sub- word unit insertions; a limit of sub-word unit deletions; a limit of sub-word unit substitutions; exclusion of words not found in a reference dictionary; a phonetic similarity cost threshold; a semantic similarity cost threshold; and user input. According to some such examples, the selecting process may involve one or more of the following: a vocabulary size; an indication of an importance of phonetic similarity; an indication of an importance of semantic similarity; and an indication of a relative importance of reducing a miss rate and reducing a false positive rate.

[0009] In some implementations, the search may involve a word unit search and a sub-word unit search. Some such methods may involve determining a word unit search score and a sub-word unit search score and combining the word unit search score and the sub-word unit search score to determine a total search score.

[0010] According to some implementations, analyzing the initial search query according to phonetic similarity may involve applying a phonetic confusion model that is based, at least in part, on a phoneme and syllable cost matrix, on a sub-word unit to word unit transducer, or on both the phoneme and syllable cost matrix and the sub-word unit to word unit transducer. In some examples, analyzing the initial search query according to semantic similarity may involve applying a semantic confusion model that is based, at least in part, on a conference vocabulary, a lexical database for a language or both the conference vocabulary and the lexical database.

[0011] Some methods may involve returning search results corresponding to the search. Some such methods may involve selecting, from the audio recording, playback audio data comprising one or more instances of speech that include the search results. Such methods may involve providing the playback audio data for playback on a speaker system. Some such methods may involve scheduling at least a portion of the instances of speech for simultaneous playback.

[0012] In some examples, the audio recording may include at least a portion of a recording of a conference involving a plurality of conference participants. Some such methods may involve receiving an indication of a selected conference participant chosen by a user from among the plurality of conference participants. Selecting the playback audio data may involve selecting one or more instances of speech of the conference recording that include speech by the selected conference participant that includes the search results.

[0013] At least some aspects of the present disclosure may be implemented via apparatus. For example, one or more devices may be capable of performing, at least in part, the methods disclosed herein. In some implementations, an apparatus may include an interface system and a control system. The interface system may include a network interface, an interface between the control system and a memory system, an interface between the control system and another device and/or an external device interface. The control system may include at least one of a general purpose single- or multi-chip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components.

[0014] The control system may be capable of performing, at least in part, the methods disclosed herein. In some implementations, the control system may be capable of receiving, via the interface system, speech recognition results data for at least a portion of an audio recording, and of receiving an initial search query including at least one search word. The control system may be capable of analyzing the initial search query according to phonetic similarity and semantic similarity and of determining an expanded search query according to the phonetic similarity, the semantic similarity, or both the phonetic similarity and the semantic similarity. In some examples, the control system may be capable of performing a search of the speech recognition results data according to the expanded search query.

[0015] In some examples, the audio recording may include at least a portion of a recording of a conference involving a plurality of conference participants. According to some such examples, the speech recognition results data may include a plurality of speech recognition lattices and a word recognition confidence score for each of a plurality of hypothesized words of the speech recognition lattices. The word recognition confidence score may correspond with a likelihood of a hypothesized word correctly corresponding with an actual word spoken by a conference participant during the conference.

[0016] According to some implementations, analyzing the initial search query may involve analyzing syllables and phonemes of the initial search query. In some examples, determining the expanded search query may involve a selecting process of determining candidate search query terms and selecting candidate search query terms to produce a refined search term list.

[0017] Some or all of the methods described herein may be performed by one or more devices according to instructions (e.g., software) stored on non-transitory media. Such non- transitory media may include memory devices such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc. Accordingly, various innovative aspects of the subject matter described in this disclosure can be implemented in a non-transitory medium having software stored thereon. The software may, for example, include instructions for controlling at least one device to process audio data. The software may, for example, be executable by one or more components of a control system such as those disclosed herein.

[0018] According to some examples, the software may include instructions for receiving speech recognition results data for at least a portion of an audio recording, for receiving an initial search query including at least one search word, for analyzing the initial search query according to phonetic similarity and semantic similarity and for determining an expanded search query according to the phonetic similarity, the semantic similarity, or both the phonetic similarity and the semantic similarity. In some such examples, the software may include instructions for performing a search of the speech recognition results data according to the expanded search query.

[0019] In some examples, the audio recording may include at least a portion of a recording of a conference involving a plurality of conference participants. According to some such examples, the speech recognition results data may include a plurality of speech recognition lattices and a word recognition confidence score for each of a plurality of hypothesized words of the speech recognition lattices. The word recognition confidence score may correspond with a likelihood of a hypothesized word correctly corresponding with an actual word spoken by a conference participant during the conference.

[0020] According to some implementations, analyzing the initial search query may involve analyzing syllables and phonemes of the initial search query. In some examples, determining the expanded search query may involve a selecting process of determining candidate search query terms and selecting candidate search query terms to produce a refined search term list.

[0021] Details of one or more implementations of the subject matter described in this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages will become apparent from the description, the drawings, and the claims. Note that the relative dimensions of the following figures may not be drawn to scale.

BRIEF DESCRIPTION OF THE DRAWINGS

[0022] Figure 1A shows examples of components of a teleconferencing system.

[0023] Figure IB is a block diagram that shows examples of components of an apparatus capable of implementing various aspects of this disclosure.

[0024] Figure 1C is a flow diagram that outlines one example of a method that may be performed by the apparatus of Figure IB.

[0025] Figure 2A shows additional examples of components of a teleconferencing system.

[0026] Figure 2B shows examples of packet trace files and conference metadata.

[0027] Figure 3A is a block diagram that shows examples of components of an apparatus capable of implementing various aspects of this disclosure.

[0028] Figure 3B is a flow diagram that outlines one example of a method that may be performed by the apparatus of Figure 3A.

[0029] Figure 3C shows additional examples of components of a teleconferencing system.

[0030] Figure 4 shows examples of components of an uplink analysis module.

[0031] Figure 5 shows examples of components of a joint analysis module.

[0032] Figure 6 shows examples of components of a playback system and associated equipment. [0033] Figure 7 shows an example of an in-person conference implementation.

[0034] Figure 8 is a flow diagram that outlines one example of a method according to some implementations of this disclosure.

[0035] Figure 9 is a block diagram that shows examples of modules that may be used to perform some of the methods disclosed herein.

[0036] Figure 10 is a block diagram that shows examples of inputs for a phonetic confusion model and a semantic confusion model.

[0037] Figure 11 provides an example of determining costs for a phoneme and syllable cost matrix.

[0038] Figure 12 shows an example of a small phoneme confusion matrix and a

WFST.

[0039] Figure 13 shows an example of phoneme-to-word generation from a conference pronunciation dictionary.

[0040] Figure 14 shows an example of a simple synonym-to-word (N2W) process.

[0041] Figure 15 is a block diagram that shows more detailed examples of elements that may be involved in a conference keyword search process.

[0042] Figure 16 is a flow diagram that shows an example of determining an expanded search query for a word unit search.

[0043] Figure 17 is a flow diagram that shows an example of determining an expanded search query for a sub-word unit search.

[0044] Figure 18 shows an example of merging search results from different sources.

[0045] Figure 19 shows an example of a graphical user interface (GUI) for receiving user input regarding search parameters.

[0046] Figure 20 is a flow diagram that outlines blocks of some topic analysis methods disclosed herein.

[0047] Figure 21 shows examples of topic analysis module elements.

[0048] Figure 22 shows an example of an input speech recognition lattice.

[0049] Figure 23, which includes Figures 23 A and 23B, shows an example of a portion of a small speech recognition lattice after pruning.

[0050] Figure 24, which includes Figures 24A and 24B, shows an example of a user interface that includes a word cloud for an entire conference recording.

[0051] Figure 25, which includes Figures 25A and 25B, shows an example of a user interface that includes a word cloud for each of a plurality of conference segments. [0052] Figure 26 is a flow diagram that outlines blocks of some playback control methods disclosed herein.

[0053] Figure 27 shows an example of selecting a topic from a word cloud.

[0054] Figure 28 shows an example of selecting both a topic from a word cloud and a conference participant from a list of conference participants.

[0055] Like reference numbers and designations in the various drawings indicate like elements.

DESCRIPTION OF EXAMPLE EMBODIMENTS

[0056] The following description is directed to certain implementations for the purposes of describing some innovative aspects of this disclosure, as well as examples of contexts in which these innovative aspects may be implemented. However, the teachings herein can be applied in various different ways. For example, while various implementations are described in terms of particular examples of audio data processing in the teleconferencing context, the teachings herein are widely applicable to other known audio data processing contexts, such as processing audio data corresponding to in-person conferences. Such conferences may, for example, include academic and/or professional conferences, stock broker calls, doctor/client visits, monologues, such as monologues for personal diarization (e.g., via a portable recording device such as a wearable recording device), etc.

[0057] Moreover, the described embodiments may be implemented in a variety of hardware, software, firmware, etc. For example, aspects of the present application may be embodied, at least in part, in an apparatus (a teleconferencing bridge and/or server, an analysis system, a playback system, a personal computer, such as a desktop, laptop, or tablet computer, a telephone, such as a desktop telephone, a smart phone or other cellular telephone, a television set-top box, a digital media player, etc.), a method, a computer program product, in a system that includes more than one apparatus (including but not limited to a

teleconferencing system), etc. Accordingly, aspects of the present application may take the form of a hardware embodiment, a software embodiment (including firmware, resident software, microcodes, etc.) and/or an embodiment combining both software and hardware aspects. Such embodiments may be referred to herein as a "circuit," a "module" or "engine." Some aspects of the present application may take the form of a computer program product embodied in one or more non-transitory media having computer readable program code embodied thereon. Such non-transitory media may, for example, include a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. Accordingly, the teachings of this disclosure are not intended to be limited to the implementations shown in the figures and/or described herein, but instead have wide applicability.

[0058] Some aspects of the present disclosure involve the recording, processing and playback of audio data corresponding to conferences, such as teleconferences. In some teleconference implementations, the audio experience heard when a recording of the conference is played back may be substantially different from the audio experience of an individual conference participant during the original teleconference. In some

implementations, the recorded audio data may include at least some audio data that was not available during the teleconference. In some examples, the spatial and/or temporal characteristics of the played-back audio data may be different from that of the audio heard by participants of the teleconference.

[0059] Figure 1A shows examples of components of a teleconferencing system. The components of the teleconferencing system 100 may be implemented via hardware, via software stored on non-transitory media, via firmware and/or by combinations thereof. The types and numbers of components shown in Figure 1A are merely shown by way of example. Alternative implementations may include more, fewer and/or different components.

[0060] In this example, the teleconferencing system 100 includes a teleconferencing apparatus 200 that is capable of providing the functionality of a teleconferencing server according to a packet-based protocol, which is a VoIP (Voice over Internet Protocol) in this implementation. At least some of the telephone endpoints 1 may include features that allow conference participants to use a software application running on a desktop or laptop computer, a smartphone, a dedicated VoIP telephone device or another such device to act as a telephony client, connecting to the teleconferencing server over the Internet.

[0061] However, some of the telephone endpoints 1 may not include such features. Accordingly, the teleconferencing system 100 may provide access via the PSTN (Public Switched Telephone Network), e.g., in the form of a bridge that transforms the traditional telephony streams from the PSTN into VoIP data packet streams.

[0062] In some implementations, during a teleconference the teleconferencing apparatus 200 receives a plurality of individual uplink data packet streams 7 and transmits a plurality of individual downlink data packet streams 8 to and from a plurality of telephone endpoints 1. The telephone endpoints 1 may include telephones, personal computers, mobile electronic devices (e.g., cellular telephones, smart phones, tablets, etc.) or other appropriate devices. Some of the telephone endpoints 1 may include headsets, such as stereophonic headsets. Other telephone endpoints 1 may include a traditional telephone handset. Still other telephone endpoints 1 may include teleconferencing speaker phones, which may be used by multiple conference participants. Accordingly, the individual uplink data packet streams 7 received from some such telephone endpoints 1 may include teleconference audio data from multiple conference participants.

[0063] In this example, one of the telephone endpoints includes a teleconference recording module 2. Accordingly, the teleconference recording module 2 receives a downlink data packet stream 8 but does not transmit an uplink data packet stream 7. Although shown as a separate apparatus in Figure 1A, teleconference recording module 2 may be implemented as hardware, software and/or firmware. In some examples, the teleconference recording module 2 may be implemented via a hardware, software and/or firmware of a teleconferencing server. However, the teleconference recording module 2 is purely optional. Other implementations of the teleconferencing system 100 do not include the teleconference recording module 2.

[0064] Voice transmission over packet networks is subject to delay variation, commonly known as jitter. Jitter may, for example, be measured in terms of inter-arrival time (IAT) variation or packet delay variation (PDV). IAT variation may be measured according to the receive time difference of adjacent packets. PDV may, for example, be measured by reference to time intervals from a datum or "anchor" packet receive time. In Internet Protocol (IP)-based networks, a fixed delay can be attributed to algorithmic, processing and propagation delays due to material and/or distance, whereas a variable delay may be caused by the fluctuation of IP network traffic, different transmission paths over the Internet, etc.

[0065] Teleconferencing servers generally rely on a "jitter buffer" to counter the negative impact of jitter. By introducing an additional delay between the time a packet of audio data is received and the time that the packet is reproduced, a jitter buffer can transform an uneven flow of arriving packets into a more regular flow of packets, such that delay variations will not cause perceptual sound quality degradation to the end users. However, voice communication is highly delay-sensitive. According to ITU Recommendation G.114, for example, one-way delay (sometimes referred to herein as a "mouth-to-ear latency time threshold") should be kept below 150 milliseconds (ms) for normal conversation, with above 400 ms being considered unacceptable. Typical latency targets for teleconferencing are lower than 150 ms, e.g., 100 ms or below.

[0066] The low latency requirement may place an upper limit on how long the teleconferencing apparatus 200 may wait for an expected uplink data packet to arrive without annoying conference participants. Uplink data packets that arrive too late for reproduction during a teleconference will not be provided to the telephone endpoints 1 or the teleconference recording module 2. Instead, the corresponding downlink data packet streams 8 will be provided to the telephone endpoints 1 and the teleconference recording module 2 with missing or late data packets dropped. In the context of this disclosure, a "late" data packet is a data packet that arrived too late to be provided to the telephone endpoints 1 or the teleconference recording module 2 during a teleconference.

[0067] However, in various implementations disclosed herein, the teleconferencing apparatus 200 may be capable of recording more complete uplink data packet streams 7. In some implementations, the teleconferencing apparatus 200 may be capable of including late data packets in the recorded uplink data packet streams 7 that were received after a mouth- to-ear latency time threshold of the teleconference and therefore were not used for reproducing audio data to conference participants during the teleconference. In some such implementations, the teleconferencing apparatus 200 may be capable of determining that a late data packet of an incomplete uplink data packet stream has not been received from a telephone endpoint within a late packet time threshold. The late packet time threshold may be greater than or equal to a mouth-to-ear latency time threshold of the teleconference. For example, in some implementations the late packet time threshold may be greater than or equal to 200 ms, 400 ms, 500 ms, 1 second or more.

[0068] In some examples, the teleconferencing apparatus 200 may be capable of determining that a data packet of an incomplete uplink data packet stream has not been received from a telephone endpoint within a missing packet time threshold, greater than the late packet time threshold. In some such examples, the teleconferencing apparatus 200 may be capable of transmitting a request, to the telephone endpoint, to re-send a missing data packet. Like the late data packets, the missing data packets would not have been recorded by the teleconference recording module 2. The missing packet time threshold may, in some implementations, be hundreds of milliseconds or even several seconds, e.g., 5 seconds, 10 seconds, 20 seconds, 30 seconds, etc. In some implementations, the missing packet time threshold may be one minute or longer, e.g., 2, minutes, 3 minutes, 4, minutes, 5 minutes, etc. [0069] In this example, the teleconferencing apparatus 200 is capable of recording the individual uplink data packet streams 7 and providing them to the conference recording database 3 as individual uplink data packet streams. The conference recording database 3 may be stored in one or more storage systems, which may or may not be in the same location as the teleconferencing apparatus 200, depending on the particular implementation. Accordingly, in some implementations the individual uplink data packet streams that are recorded by the teleconferencing apparatus 200 and stored in the conference recording database 3 may be more complete than the data packet streams available during the teleconference.

[0070] In the implementation shown in Figure 1A, the analysis engine 307 is capable of analyzing and processing the recorded uplink data packet streams to prepare them for playback. In this example, the analysis results from the analysis engine 307 are stored in the analysis results database 5, ready for playback by the playback system 609. In some examples, the playback system 609 may include a playback server, which may be capable of streaming analysis results over a network 12 (e.g., the Internet). In Figure 1A, the playback system 609 is shown streaming analysis results to a plurality of listening stations 11 (each of which may include one or more playback software applications running on a local device, such as a computer). Here, one of the listening stations 11 includes headphones 607 and the other listening station 11 includes a speaker array 608.

[0071] As noted above, due to latency issues the playback system 609 may have a more complete set of data packets available for reproduction than were available during the teleconference. In some implementations, there may be other differences and/or additional differences between the teleconference audio data reproduced by the playback system 609 and the teleconference audio data available for reproduction during the teleconference. For example, a teleconferencing system generally limits the data rates for uplink and downlink data packets to a rate that can be reliably maintained by the network. Furthermore, there is often a financial incentive to keep the data rate down, because the teleconference service provider may need to provision more expensive network resources if the combined data rate of the system is too high.

[0072] In addition to data rate constraints, there may be practical constraints on the number of IP packets that can be reliably handled each second by network components such as switches and routers, and also by software components such as the TCP/IP stack in the kernel of a teleconferencing server's host operating system. Such constraints may have implications for how the data packet streams corresponding to teleconferencing audio data are encoded and partitioned into IP packets.

[0073] A teleconferencing server needs to process data packets and perform mixing operations, etc., quickly enough to avoid perceptual quality degradation to conference participants, and generally must do so with an upper bound on computational resources. The smaller the computational overhead that is required to service a single conference participant, the larger the number of conference participants that can be handled in real time by a single piece of server equipment. Therefore keeping the computational overhead relatively small provides economic benefits to teleconference service providers.

[0074] Most teleconference systems are so-called "reservationless" systems. This means that the teleconferencing server does not "know" ahead of time how many teleconferences it will be expected to host at once, or how many conference participants will connect to any given teleconference. At any time during a teleconference, the server has neither an indication of how many additional conference participants may subsequently join the teleconference nor an indication of how many of the current conference participants may leave the teleconference early.

[0075] Moreover, a teleconferencing server will generally not have meeting dynamics information prior to a teleconference regarding of what kind of human interaction is expected to occur during a teleconference. For example, it will not be known in advance whether one or more conference participants will dominate the conversation, and if so, which conference participant(s). At any instant in time, the teleconferencing server must decide what audio to provide in each downlink data packet stream based only on what has occurred in the teleconference until that instant.

[0076] However, the foregoing set of constraints will generally not apply when the analysis engine 307 processes the individual uplink data packet streams that are stored in the conference recording database 3. Similarly, the foregoing set of constraints will generally not apply when the playback system 609 is processing and reproducing data from the analysis results database 5, which has been output from the analysis engine 307.

[0077] For example, assuming that analysis and playback occur after the teleconference is complete, the playback system 609 and/or the analysis engine 307 may use information from the entire teleconference recording in order to determine how best to process, mix and/or render any instant of the teleconference for reproduction during playback. Even if the teleconference recording only corresponds to a portion of the teleconference, data corresponding to that entire portion will be available for determining how optimally to mix, render and otherwise process the recorded teleconference audio data (and possibly other data, such as teleconference metadata) for reproduction during playback.

[0078] In many implementations, the playback system 609 may be providing audio data, etc., to a listener who is not trying to interact with those in the teleconference. Accordingly, the playback system 609 and/or the analysis engine 307 may have seconds, minutes, hours, days, or even a longer time period in which to analyze and/or process the recorded teleconference audio data and make the teleconference available for playback. This means that computationally-heavy and/or data-heavy algorithms, which can only be performed slower than real time on the available hardware, may be used by the analysis engine 307 and/or the playback system 609. Due to these relaxed time constraints, some implementations may involve queueing up teleconference recordings for analysis and analyzing them when resources permit (e.g., when analysis of previously-recorded teleconferences is complete or at "off-peak" times of day when electricity or cloud computing resources are less expensive or more readily available).

[0079] Assuming that analysis and playback occur after a teleconference is complete, the analysis engine 307 and the playback system 609 can have access to a complete set of teleconference participation information, e.g., information regarding which conference participants were involved in the teleconference and the times at which each conference participant joined and left the teleconference. Similarly, assuming that analysis and playback occur after the teleconference is complete, the analysis engine 307 and the playback system 609 can have access to a complete set of teleconference audio data and any associated metadata from which to determine (or at least to estimate) when each participant spoke. This task may be referred to herein as "speaker diarization." Based on speaker diarization information, the analysis engine 307 can determine conversational dynamics data such as which conference participant(s) spoke the most, who spoke to whom, who interrupted whom, how much doubletalk (times during which at least two conference participants are speaking simultaneously) occurred during the teleconference, and potentially other useful information which the analysis engine 307 and/or the playback system 609 can use in order to determine how best to mix and render the conference during playback. Even if the teleconference recording only corresponds to a portion of the teleconference, data corresponding to that entire portion will be available for determining teleconference participation information, conversational dynamics data, etc.

[0080] The present disclosure includes methods and devices for recording, analyzing and playing back teleconference audio data such that the teleconference audio data presented during playback may be substantially different from what would have been heard by conference participants during the original teleconference and/or what would have been recorded during the original teleconference by a recording device such as the teleconference recording device 2 shown in Figure 1A. Various implementations disclosed herein make use of one or more of the above-identified constraint differences between the live teleconference and the playback use-cases to produce a better user experience during playback. Without loss of generality, we now discuss a number of specific implementations and particular methods for recording, analyzing and playing back teleconference audio data such that the playback can be advantageously different from the original teleconference experience.

[0081] Figure IB is a block diagram that shows examples of components of an apparatus capable of implementing various aspects of this disclosure. The types and numbers of components shown in Figure IB are merely shown by way of example. Alternative implementations may include more, fewer and/or different components. The apparatus 10 may, for example, be an instance of a teleconferencing apparatus 200. In some examples, the apparatus 10 may be a component of another device. For example, in some implementations the apparatus 10 may be a component of a teleconferencing apparatus 200, e.g., a line card.

[0082] In this example, the apparatus 10 includes an interface system 105 and a control system 110. The interface system 105 may include one or more network interfaces, one or more interfaces between the control system 110 and a memory system and/or one or more an external device interfaces (such as one or more universal serial bus (USB) interfaces). The control system 110 may, for example, include a general purpose single- or multi-chip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, and/or discrete hardware components. In some implementations, the control system 110 may be capable of providing teleconference server functionality.

[0083] Figure 1C is a flow diagram that outlines one example of a method that may be performed by the apparatus of Figure IB. The blocks of method 150, like other methods described herein, are not necessarily performed in the order indicated. Moreover, such methods may include more or fewer blocks than shown and/or described.

[0084] In this implementation, block 155 involves receiving teleconference audio data during a teleconference, via an interface system. For example, the teleconference audio data may be received by the control system 110 via the interface system 105 in block 155. In this example, the teleconference audio data includes a plurality of individual uplink data packet streams, such as the uplink data packet streams 7 shown in Figure 1 A. Accordingly, each uplink data packet stream corresponds to a telephone endpoint used by one or more conference participants.

[0085] In this example, block 160 involves sending to a memory system, via the interface system, the teleconference audio data as individual uplink data packet streams. Accordingly, instead of being recorded as mixed audio data received as one of the downlink data packet streams 8 shown in Figure 1A, such as the downlink data packet stream 8 that is recorded by the teleconference recording device 2, the packets received via each of the uplink data packet streams 7 are recorded and stored as as individual uplink data packet streams.

[0086] However, in some examples at least one of the uplink data packet streams may correspond to multiple conference participants. For example, block 155 may involve receiving such an uplink data packet stream from a spatial speakerphone used by multiple conference participants. Accordingly, in some instances the corresponding uplink data packet stream may include spatial information regarding each of the multiple participants.

[0087] In some implementations, the individual uplink data packet streams received in block 155 may be individual encoded uplink data packet streams. In such implementations, block 160 may involve sending the teleconference audio data to the memory system as individual encoded uplink data packet streams.

[0088] As noted above, in some examples the interface system 105 may include a network interface. In some such examples, block 160 may involve sending the teleconference audio data to a memory system of another device via the network interface. However, in some implementations the apparatus 10 may include at least part of the memory system. The interface system 105 may include an interface between the control system and at least part of the memory system. In some such implementations, block 160 may involve sending the teleconference audio data to a memory system of the apparatus 10.

[0089] Due at least in part to the teleconferencing latency issues described above, at least one of the uplink data packet streams may include at least one data packet that was received after a mouth-to-ear latency time threshold of the teleconference and was therefore not used for reproducing audio data during the teleconference. The mouth-to-ear latency time threshold may differ from implementation to implementation, but in many implementations the mouth-to-ear latency time threshold may be 150 ms or less. In some examples, the mouth-to-ear latency time threshold may be greater than or equal to 100 ms. [0090] In some implementations, the control system 110 may be capable of determining that a late data packet of an incomplete uplink data packet stream has not been received from a telephone endpoint within a late packet time threshold. In some implementations, the late packet time threshold may be greater than or equal to a mouth-to- ear latency time threshold of the teleconference. For example, in some implementations the late packet time threshold may be greater than or equal to 200 ms, 400 ms, 500 ms, 1 second or more. In some examples, the control system 110 may be capable of determining that a data packet of an incomplete uplink data packet stream has not been received from a telephone endpoint within a missing packet time threshold, greater than the late packet time threshold. In some implementations, the control system 110 may be capable of transmitting a request to the telephone endpoint, via the interface system 105, to re-send the missing data packet. The control system 110 may be capable of receiving the missing data packet and of adding the missing data packet to the incomplete uplink data packet stream.

[0091] Figure 2 shows additional examples of components of a teleconferencing system. The types and numbers of components shown in Figure 2 are merely shown by way of example. Alternative implementations may include more, fewer and/or different components. In this example, the teleconferencing apparatus 200 includes a VoIP teleconferencing bridge. In this example, there are five telephone endpoints being used by the conference participants, including two headset endpoints 206, a spatial speakerphone endpoint 207, and two PSTN endpoints 208. The spatial speakerphone endpoint 207 may be capable of providing spatial information corresponding to positions of each of multiple conference participants. Here, a PSTN bridge 209 forms a gateway between an IP network and the PSTN endpoints 208, converting PSTN signals to IP data packet streams and vice versa.

[0092] Figure 2A shows additional examples of components of a teleconferencing system. The types and numbers of components shown in Figure 2A are merely shown by way of example. Alternative implementations may include more, fewer and/or different components. In this example, the teleconferencing apparatus 200 includes a VoIP teleconferencing bridge. In this example, there are five telephone endpoints being used by the conference participants, including two headset endpoints 206, a spatial speakerphone endpoint 207, and two PSTN endpoints 208. The spatial speakerphone endpoint 207 may be capable of providing spatial information corresponding to positions of each of multiple conference participants. Here, a PSTN bridge 209 forms a gateway between an IP network and the PSTN endpoints 208, converting PSTN signals to IP data packet streams and vice versa.

[0093] In Figure 2A, uplink data packet streams 201A-205A, each corresponding to one of the five telephone endpoints, are being received by the teleconferencing apparatus 200. In some instances, there may be multiple conference participants participating in the teleconference via the spatial speakerphone endpoint 207. If so, the uplink data packet stream 203A may include audio data and spatial information for each of the multiple conference participants.

[0094] In some implementations, each of the uplink data packet streams 201A-205A may include a sequence number for each data packet, as well as a data packet payload. In some examples, each of the uplink data packet streams 201A-205A may include a talkspurt number corresponding with each talkspurt included in an uplink data packet stream. For example, each telephone endpoint (or a device associated with a telephone endpoint such as the PSTN bridge 209) may include a voice activity detector that is capable detecting instances of speech and non-speech. The telephone endpoint or associated device may include a talkspurt number in one or more data packets of an uplink data packet stream corresponding with such instances of speech, and may increment the talkspurt number each time that the voice activity detector determines that speech has recommenced after a period of non-speech. In some implementations, the talkspurt number may be a single bit that toggles between 1 and 0 at the start of each talkspurt.

[0095] In this example, the teleconferencing apparatus 200 assigns a "receive" timestamp to each received uplink data packet. Here, the teleconferencing apparatus 200 sends packet trace files 201B-205B, each of which corresponds to one of the uplink data packet streams 201A-205A, to the conference recording database 3. In this implementation, the packet trace files 201B-205B include a receive timestamp for each received uplink data packet, as well as the received sequence number, talkspurt number and data packet pay loads.

[0096] In this example, the teleconferencing apparatus 200 also sends conference metadata 210 to the conference recording database 3. The conference metadata 210 may, for example, include data regarding individual conference participants, such as conference participant name, conference participant location, etc. The conference metadata 210 may indicate associations between individual conference participants and one of the packet trace files 201B-205B. In some implementations, the packet trace files 201B-205B and the conference metadata 210 may together form one teleconference recording in the conference recording database 3. [0097] Figure 2B shows examples of packet trace files and conference metadata. In this example, the conference metadata 210 and the packet trace files 201B-204B have data structures that are represented as tables that include four columns, also referred to herein as fields. The particular data structures shown in Figure 2B are merely made by way of example; other examples may include more or fewer fields. As described elsewhere herein, in some implementations the conference metadata 210 may include other types of information that are not shown in Figure 2B.

[0098] In this example, the conference metadata 210 data structure includes a conference participant name field 212, a connection time field 214 (indicating when the corresponding conference participants joined the conference), a disconnection time field 216 (indicating when the corresponding conference participants left the conference) and a packet trace file field 218. It may be seen in this example that the same conference participant may be listed multiple times in the conference metadata 210 data structure, once for every time he or she joins or rejoins the conference. The packet trace file field 218 includes information for identifying a corresponding packet trace file.

[0099] Accordingly, the conference metadata 210 provides a summary of some events of a conference, including who participated, for how long, etc. In some implementations, the conference metadata 210 may include other information, such as the endpoint type (e.g., headset, mobile device, speaker phone, etc.).

[00100] In this example, each of the packet trace files 201B-204B also includes four fields, each field corresponding to a different type of information. Here, each of the packet trace files 201B-204B includes a received time field 222, a sequence number field 224, a talkspurt identification field 226 and a payload data field 228. The sequence numbers and talkspurt numbers, which may be included in packet payloads, enable the payloads to be arranged in the correct order. In this example, each instance of payload data indicated by the payload data field 228 corresponds to the remainder of the payload of a packet after the sequence number and talkspurt number have been removed, including the audio data corresponding to the corresponding conference participant. Each of the packet trace files 201B-204B may, for example, contain the payload data of packets originating from an endpoint such as those shown in Figure 2A. One packet trace file may include payload data from a large number of packets.

[00101] Although not shown in Figure 2B, the conference metadata 210 corresponds to a particular conference. Accordingly, the metadata and packet trace files 201B-204B for a conference, including the payload data, may be stored for later retrieval according to, e.g., a conference code.

[00102] The packet trace files 201B-204B and the conference metadata 210 may change over the duration of a conference, as more information is added. According to some implementations, such changes may happen locally, with the final packet trace files and the conference metadata 210 being sent to the conference recording database 3 after the conference has ended. Alternatively, or additionally, the packet trace files 201B-204B and/or the conference metadata 210 can be created, and then updated, on the conference recording database 3.

[00103] Figure 3A is a block diagram that shows examples of components of an apparatus capable of implementing various aspects of this disclosure. The types and numbers of components shown in Figure 3A are merely shown by way of example. Alternative implementations may include more, fewer and/or different components. The apparatus 300 may, for example, be an instance of an analysis engine 307. In some examples, the apparatus 300 may be another device, or may be a component of another device. For example, in some implementations the apparatus 300 may be a component of an analysis engine 307, e.g., an uplink analysis module described elsewhere herein.

[00104] In this example, the apparatus 300 includes an interface system 325 and a control system 330. The interface system 325 may include one or more network interfaces, one or more interfaces between the control system 330 and a memory system and/or one or more an external device interfaces (such as one or more universal serial bus (USB) interfaces). In some implementations, the interface system 325 may include, or may be capable of providing, one or more user interfaces. In some examples, the control system may control at least a portion of the interface system 325 to provide a graphical user interface. The control system 330 may, for example, include a general purpose single- or multi-chip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, and/or discrete hardware components.

[00105] Figure 3B is a flow diagram that outlines one example of a method that may be performed by the apparatus of Figure 3A. The blocks of method 350, like other methods described herein, are not necessarily performed in the order indicated. Moreover, such methods may include more or fewer blocks than shown and/or described.

[00106] In this implementation, block 355 involves receiving previously stored audio data, also referred to herein as recorded audio data, for a teleconference, via an interface system. For example, the recorded audio data may be received by the control system 330 via the interface system 325 in block 355. In this example, the recorded audio data includes at least one individual uplink data packet stream corresponding to a telephone endpoint used by one or more conference participants.

[00107] Here, the received individual uplink data packet stream includes timestamp data corresponding to data packets of the individual uplink data packet stream. As noted above, in some implementations a teleconferencing apparatus 200 may assign a receive timestamp to each received uplink data packet. A teleconferencing apparatus 200 may store, or may cause to be stored, time-stamped data packets in the order they were received by the teleconference server 200. Accordingly, in some implementations block 355 may involve receiving the recorded audio data, including the individual uplink data packet stream that includes timestamp data, from a conference recording database 3 such as that shown in Figure 1A, above.

[00108] In this example, block 360 involves analyzing timestamp data of data packets in the individual uplink data packet stream. Here, the analyzing process of block 360 involves determining whether the individual uplink data packet stream includes at least one out-of-order data packet. In this implementation, if the individual uplink data packet stream includes at least one out-of-order data packet, the individual uplink data packet stream will be re-ordered according to the timestamp data, in block 365.

[00109] In some implementations, at least one data packet of the individual uplink data packet stream may have been received after a mouth-to-ear latency time threshold of the teleconference. If so, the individual uplink data packet stream includes data packets that would not have been available for including in downlink data packet streams for reproduction to conference participants or for recording at a telephone endpoint. Data packets received after the mouth-to-ear latency time threshold may or may not have been received out of order, depending on the particular circumstance.

[00110] The control system 330 of Figure 3A may be capable of various other functionality. For example, the control system 330 may be capable of receiving, via the interface system 325, teleconference metadata and of indexing the individual uplink data packet stream based, at least in part, on the teleconference metadata.

[00111] The recorded audio data received by the control system 330 may include a plurality of individual encoded uplink data packet streams, each of the individual encoded uplink data packet streams corresponding to a telephone endpoint used by one or more conference participants. In some implementations, as described in more detail below, the control system 330 may include a joint analysis module capable of analyzing a plurality of individual uplink data packet streams. The joint analysis module may be capable of determining conversational dynamics data, such as data indicating the frequency and duration of conference participant speech, data indicating instances of conference participant doubletalk during which at least two conference participants are speaking simultaneously and/or data indicating instances of conference participant conversations.

[00112] The control system 330 may be capable of decoding each of the plurality of individual encoded uplink data packet streams. In some implementations, the control system 330 may be capable of providing one or more decoded uplink data packet streams to a speech recognition module capable of recognizing speech and generating speech recognition results data. The speech recognition module may be capable of providing the speech recognition results data to the joint analysis module. In some implementations, the joint analysis module may be capable of identifying keywords in the speech recognition results data and of indexing keyword locations.

[00113] In some implementations, the control system 330 may be capable of providing one or more decoded uplink data packet streams to a speaker diarization module. The speaker diarization module may be capable of identifying speech of each of multiple conference participants in an individual decoded uplink data packet stream. The speaker diarization module may be capable of generating a speaker diary indicating times at which each of the multiple conference participants were speaking and of providing the speaker diary to the joint analysis module. In some implementations, the control system 330 may be capable of providing a plurality of individual decoded uplink data packet streams to the joint analysis module.

[00114] Figure 3C shows additional examples of components of a teleconferencing system. The types and numbers of components shown in Figure 3C are merely shown by way of example. Alternative implementations may include more, fewer and/or different components. In this implementation, various files from a conference recording database 3 and information from a conference database 308 are being received by an analysis engine 307. The analysis engine 307 and its components may be implemented via hardware, via software stored on non-transitory media, via firmware and/or by combinations thereof. The information from the conference database 308 may, for example, include information regarding which conference recordings exist, regarding who has permission to listen to and/or modify each conference recording, regarding which conferences were scheduled and/or regarding who was invited to each conference, etc. [00115] In this example, the analysis engine 307 is receiving packet trace files

201B-205B from the conference recording database 3, each of which corresponds to one of the uplink data packet streams 201A-205A that had previously been received by the teleconferencing apparatus 200. The packet trace files 201B-205B may, for example, include a receive timestamp for each received uplink data packet, as well as a received sequence number, talkspurt number and data packet payloads. In this example, each of the packet trace files 201B-205B is provided to a separate one of the uplink analysis modules 301-305 for processing. In some implementations, the uplink analysis modules 301-305 may be capable of re-ordering data packets of a packet trace file, e.g., as described above with reference to Figure 3B. Some additional examples of uplink analysis module functionality are described below with reference to Figure 4.

[00116] In this example, each of the uplink analysis modules 301-305 outputs a corresponding one of the per-uplink analysis results 301C-305C. In some implementations, the per-uplink analysis results 301C-305C may be used by the playback system 609 for playback and visualization. Some examples are described below with reference to Figure 6.

[00117] Here, each of the uplink analysis modules 301-305 also provides output to the joint analysis module 306. The joint analysis module 306 may be capable of analyzing data corresponding to a plurality of individual uplink data packet streams.

[00118] In some examples, the joint analysis module 306 may be capable of analyzing conversational dynamics and determining conversational dynamics data. These and other examples of joint analysis module functionality are described in more detail below with reference to Figure 5.

[00119] In this example, the joint analysis module 306 outputs meeting overview information 311, which may include the time of a conference, names of participants, etc. In some implementations, the meeting overview information 311 may include conversational dynamics data. Here, the joint analysis module 306 also outputs segment and word cloud data 309 and a search index 310, both of which are described below with reference to Figure 5.

[00120] Here, the analysis engine 307 is also receiving conference metadata

210. As noted elsewhere herein, the conference metadata 210 may include data regarding individual conference participants, such as conference participant name and/or conference participant location, associations between individual conference participants and one of the packet trace files 201B-205B, etc. In this example, the conference metadata 210 are provided to the the joint analysis module 306.

[00121] Figure 4 shows examples of components of an uplink analysis module. The uplink analysis module 301 and its components may be implemented via hardware, via software stored on non-transitory media, via firmware and/or by combinations thereof. The types and numbers of components shown in Figure 4 are merely shown by way of example. Alternative implementations may include more, fewer and/or different components.

[00122] In this implementation, the uplink analysis module 301 is shown receiving the packet trace file 20 IB. Here, the packet trace file 20 IB, corresponding to an individual uplink data packet stream, is received and processed by the packet stream normalization module 402. In this example, the packet stream normalization module 402 is capable of analyzing sequence number data of data packets in the packet trace file 20 IB and determining whether the individual uplink data packet stream includes at least one out-of- order data packet. If the packet stream normalization module 402 determines that the individual uplink data packet stream includes at least one out-of-order data packet, in this example the packet stream normalization module 402 will re-order the individual uplink data packet stream according to the sequence numbers.

[00123] In this implementation, the packet stream normalization module 402 outputs an ordered playback stream 40 IB as one component of the uplink analysis results 301C output by the uplink analysis module 301. In some implementations, the packet stream normalization module 402 may include a playback timestamp and a data packet payload corresponding to each data packet of the ordered playback stream 401B. Here, the ordered playback stream 40 IB includes encoded data, but in alternative implementations the ordered playback stream 40 IB may include decoded data or transcoded data. In this example, the playback stream index 401A, output by the packet stream indexing module 403, is another component of the uplink analysis results 301C. The playback stream index 401A may facilitate random access playback by the playback system 609.

[00124] The packet stream indexing module 403 may, for example, determine instances of talkspurts of conference participants (e.g., according to talkspurt numbers of the input uplink packet trace) and include corresponding index information in the playback stream index 401A, in order to facilitate random access playback of the conference participant talkspurts by the playback system 609. In some implementations, the packet stream indexing module 403 may be capable of indexing according to time. For example, in some examples the packet stream indexing module 403 may be capable of forming a packet stream index that indicates the byte offset within the playback stream of the encoded audio for a corresponding playback time. In some such implementations, during playback the playback system 609 may look up a particular time in the packet stream index (for example, according to a time granularity, such as a 10- second granularity) and the packet stream index may indicate a byte offset within the playback stream of the encoded audio for that playback time. This is potentially useful because the encoded audio may have a variable bit rate or because there may be no packets when there is silence (so called "DTX" or "discontinuous transmission"). In either case, the packet stream index can facilitate fast seeking during a playback process, at least in part because there may often be a non-linear relationship between time and byte offset within the playback stream.

[00125] In the example shown in Figure 4, the decoding module 404 also receives an ordered playback stream 40 IB from the packet stream normalization module 402. In this implementation, the decoding module 404 decodes the encoded ordered playback stream 401B and provides the automatic speech recognition module 405, the visualization analysis module 406 and the speaker diarization module 407 with a decoded playback stream. In some examples, the decoded playback stream may be a pulse code modulation (PCM) stream.

[00126] According to some implementations, the decoding module 404 and/or the playback system 609 may apply a different decoding process from the decoding process used during the original teleconference. Due to time, computational and/or bandwidth constraints, the same packet of audio may be decoded in low fidelity with minimal computational requirements during the teleconference, but decoded in higher fidelity with higher computational requirements by the decoding module 404. Higher-fidelity decoding by the decoding module 404 may, for example, involve decoding to a higher sample rate, switching on spectral bandwidth replication (SBR) for better perceptual results, running more iterations of an iterative decoding process, etc.

[00127] In the example shown in Figure 4, the automatic speech recognition module 405 analyzes audio data in the decoded playback stream provided by the decoding module 404 to determine spoken words in the teleconference portion corresponding to the decoded playback stream. The automatic speech recognition module 405 outputs speech recognition results 401F to the joint analysis module 306.

[00128] In this example, the visualization analysis module 406 analyzes audio data in the decoded playback stream to determine the occurrences of talkspurts, the amplitude of the talkspurts and/or the frequency content of the talkspurts, etc., and outputs visualization data 40 ID. The visualization data 40 ID may, for example, provide information regarding waveforms that the playback system 609 may display when the teleconference is played back.

[00129] In this implementation, the speaker diarization module 407 analyzes audio data in the decoded playback stream to identify and record occurrences of speech from one or more conference participants, depending on whether a single conference participant or multiple conference participants were using the same telephone endpoint that corresponds to the input uplink packet trace 20 IB. The speaker diarization module 407 outputs speaker diary 401E which, along with the visualization data 401D, is included as part of the uplink analysis results 301C output by the analysis engine 307 (see Figure 3C). In essence, the speaker diary 401E indicates which conference participant(s) spoke and when the conference participant(s) spoke.

[00130] The uplink analysis results 301C, together with the speech recognition results 40 IF, are included in the uplink analysis results available for joint analysis 401 provided to the joint analysis module 306. Each of a plurality of uplink analysis modules may output an instance of the uplink analysis results available for joint analysis to the joint analysis module 306.

[00131] Figure 5 shows examples of components of a joint analysis module.

The joint analysis module 306 and its components may be implemented via hardware, via software stored on non-transitory media, via firmware and/or by combinations thereof. The types and numbers of components shown in Figure 5 are merely shown by way of example. Alternative implementations may include more, fewer and/or different components.

[00132] In this example, each of the uplink analysis modules 301-305 shown in Figure 3C has output a corresponding one of the uplink analysis results available for joint analysis 401-405, all of which are shown in Figure 5 as being received by the joint analysis module 306. In this implementation, the speech recognition results 401F-405F, one of which is from each of the uplink analysis results available for joint analysis 401-405, are provided to the keyword spotting and indexing module 505 and to the topic analysis module 525. In this example, the speech recognition results 401F-405F correspond to all conference participants of a particular teleconference. The speech recognition results 401F-405F may, for example, be text files.

[00133] In this example, the keyword spotting and indexing module 505 is capable of analyzing the speech recognition results 401F-405F, of identifying frequently- occurring words that were spoken by all conference participants during the teleconference and of indexing occurrences of the frequently-occurring words. In some implementations, the keyword spotting and indexing module 505 may determine and record the number of instances of each keyword. In this example, the keyword spotting and indexing module 505 outputs the search index 310.

[00134] In the example shown in Figure 5, the conversational dynamics analysis module 510 receives the speaker diaries 401E-405E, one of which is from each of the uplink analysis results available for joint analysis 401-405. The conversational dynamics analysis module 510 may be capable of determining conversational dynamics data, such as data indicating the frequency and duration of conference participant speech, data indicating instances of conference participant "doubletalk" during which at least two conference participants are speaking simultaneously, data indicating instances of conference participant conversations and/or data indicating instances of one conference participant interrupting one or more other conference participants, etc.

[00135] In this example, the conversational dynamics analysis module 510 outputs conversational dynamics data files 515a-515d, each of which corresponds to a different timescale. For example, the conversational dynamics data file 515a may correspond to a timescale wherein segments of the conference (presentation, discussion, etc.) are approximately 1 minute long, the conversational dynamics data file 515b may correspond to a timescale wherein segments of the conference are approximately 3 minutes long, the conversational dynamics data file 515c may correspond to a timescale wherein segments of the conference are approximately 5 minutes long, and the conversational dynamics data file 515d may correspond to a timescale wherein segments of the conference are approximately 7 minutes long or longer. In other implementations, the conversational dynamics analysis module 510 may output more or fewer of the conversational dynamics data files 515. In this example, the conversational dynamics data files 515a-515d are output only to the topic analysis module 525, but in other implementations the conversational dynamics data files 515a-515d may be output to one or more other modules and/or output from the entire analysis engine 307. Accordingly, in some implementations the conversational dynamics data files 515a-515d may be made available to the playback system 609.

[00136] In some implementations, the topic analysis module 525 may be capable of analyzing the speech recognition results 401F-405F and of identifying potential conference topics. In some examples, as here, the topic analysis module 525 may receive and process the conference metadata 210. Various implementations of the topic analysis module 525 are described in detail below. . In this example, the topic analysis module 525 outputs the segment and word cloud data 309, which may include with topic information for each of a plurality of conversation segments and/or topic information for each of a plurality of time intervals.

[00137] In the example shown in Figure 5, the joint analysis module includes an overview module 520. In this implementation, the overview module 520 receives the conference metadata 210 as well as data from the conference database 308. The conference metadata 210 may include data regarding individual conference participants, such as conference participant name and conference participant location, data indicating the time and date of a conference, etc. The conference metadata 210 may indicate associations between individual conference participants and telephone endpoints. For example, the conference metadata 210 may indicate associations between individual conference participants and one of the analysis results 301C-305C output by the analysis engine (see Figure 3C). The conference database 308 may provide data to the overview module 520 regarding which conferences were scheduled, regarding meeting topics and/or regarding who was invited to each conference, etc. In this example, the overview module 520 outputs meeting the overview information 311, which may include a summary of the conference metadata 210 and of the data from the conference database 308.

[00138] In some implementations, the analysis engine 307 and/or other components of the teleconferencing system 100 may be capable of other functionality. For example, in some implementations the analysis engine 307, the playback system 609 or another component of the teleconferencing system 100 may be capable of assigning virtual conference participant positions in a virtual acoustic space based, at least in part, on conversational dynamics data. In some examples, the conversational dynamics data may be based on an entire conference.

[00139] Figure 6 shows examples of components of a playback system and associated equipment. The playback system 609 and its components may be implemented via hardware, via software stored on non-transitory media, via firmware and/or by combinations thereof. The types and numbers of components shown in Figure 6 are merely shown by way of example. Alternative implementations may include more, fewer and/or different components.

[00140] In this example, the playback system 609 is receiving data corresponding to a teleconference that included three telephone endpoints, instead of a teleconference that included five telephone endpoints as described above. Accordingly, the playback system 609 is shown receiving analysis results 301C-303C, as well as the segment and word cloud data 309, the search index 310 and the meeting overview information 311.

[00141] In this implementation, the playback system 609 includes a plurality of decoding units 601A-603A. Here, decoding units 601A-603A are receiving ordered playback streams 401B-403B, one from each of the analysis results 301C-303C. In some examples, the playback system 609 may invoke one decoding unit per playback stream, so the number of decoding units may change depending on the number of playback streams received.

[00142] According to some implementations, the decoding units 601A-603A may apply a different decoding process from the decoding process used during the original teleconference. As noted elsewhere herein, during the original teleconference audio data may be decoded in low fidelity with minimal computational requirements, due to time, computational and/or bandwidth constraints. However, the ordered playback streams 401B- 403B may be decoded in higher fidelity, potentially with higher computational requirements, by the decoding units 601A-603A. Higher-fidelity decoding by the decoding units 601A- 603A may, for example, involve decoding to a higher sample rate, switching on spectral bandwidth replication (SBR) for better perceptual results, running more iterations of an iterative decoding process, etc.

[00143] In this example, a decoded playback stream is provided by each of the decoding units 601A-603A to a corresponding one of the post-processing modules 601B- 603B. As discussed in more detail below, in some implementations the post-processing modules 601B-603B may be capable of one or more types of processing to speed up the playback of the ordered playback streams 401B-403B. In some such examples, the postprocessing modules 601B-603B may be capable of removing silent portions from the ordered playback streams 401B-403B, overlapping portions of the ordered playback streams 401B-403B that were not previously overlapping, changing the amount of overlap of previously overlapping portions of the ordered playback streams 401B-403B and/or other processing to speed up the playback of the ordered playback streams 401B-403B.

[00144] In this implementation, a mixing and rendering module 604 receives output from the post-processing modules 601B-603B. Here, the mixing and rendering module 604 is capable of mixing the individual playback streams received from the postprocessing modules 601B-603B and rendering the resulting playback audio data for reproduction by a speaker system, such as the headphones 607 and/or the speaker array 608. In some examples, the mixing and rendering module 604 may provide the playback audio data directly to a speaker system, whereas in other implementations the mixing and rendering module 604 may provide the playback audio data to another device, such as the display device 610, which may be capable of communication with the speaker system. In some implementations, the mixing and rendering module 604 may be capable of rendering the mixed audio data according to spatial information determined by the analysis engine 307. For example, the mixing and rendering module 604 may be capable of rendering the mixed audio data for each conference participant to an assigned virtual conference participant position in a virtual acoustic space based on such spatial information. In some alternative implementations, the mixing and rendering module 604 also may be capable of determining such spatial information. In some instances, the mixing and rendering module 604 may render teleconference audio data according to different spatial parameters than were used for rendering during the original teleconference.

[00145] In some implementations, some functionality of the playback system

609 may be provided, at least in part, according to "cloud-based" systems. For example, in some implementations the playback system 609 may be capable of communicating with one or more other devices, such as one or more servers, via a network. In the example shown in Figure 6, the playback system 609 is shown communicating with an optional playback control server 650 and an optional rendering server 660, via one or more network interfaces (not shown). According to some such implementations, at least some of the functionality that could, in other implementations, be performed by the mixing and rendering module 604 may be performed by the rendering server 660. Similarly, in some implementations at least some of the functionality that could, in other implementations, be performed by the playback control module 605 may be performed by the playback control server 650. In some implementations, the functionality of the decoding units 601A-603A and/or the postprocessing modules 601B-603B may be performed by one or more servers. According to some examples, the functionality of the entire playback system 609 may be implemented by one or more servers. The results may be provided to a client device, such as the display device 610, for playback.

[00146] In this example, a playback control module 605 is receiving the playback stream indices 401A-403A, one from each of the analysis results 301C-303C. Although not shown in Figure 6, the playback control module 605 also may receive other information from the the analysis results 301C-303C, as well as the segment and word cloud data 309, the search index 310 and the meeting overview information 311. The playback control module 605 may be capable of controlling a playback process (including reproduction of audio data from the mixing and rendering module 604) based, at least in part, on user input (which may be received via the display device 610 in this example), on the analysis results 301C-303C, on the segment and word cloud data 309, the search index 310 and/or on the meeting overview information 311.

[00147] In this example, the display device 610 is shown providing a graphical user interface 606, which may be used for interacting with playback control module 605 to control playback of audio data. The display device 610 may, for example, be a laptop computer, a tablet computer, a smart phone or another type of device. In some implementations, a user may be able to interact with the graphical user interface 606 via a user interface system of the display device 610, e.g., by touching an overlying touch screen, via interaction with an associated keyboard and/or mouse, by voice command via a microphone and associated software of the display device 610, etc.

[00148] In the example shown in Figure 6, each row 615 of the graphical user interface 606 corresponds to a particular conference participant. In this implementation, the graphical user interface 606 indicates conference participant information 620, which may include a conference participant name, conference participant location, conference participant photograph, etc. In this example, waveforms 625, corresponding to instances of the speech of each conference participant, are also shown the graphical user interface 606. The display device 610 may, for example, display the waveforms 625 according to instructions from playback control module 605. Such instructions may, for example be based on visualization data 410D-403D that is included in the analysis results 301C-303C. In some examples, a user may be able to change the scale of the graphical user interface 606, according to a desired time interval of the conference to be represented. For example, a user may be able to "zoom in" or enlarge at least a portion of the graphical user interface 606 to show a smaller time interval or "zoom out" at least a portion of the graphical user interface 606 to show a larger time interval. According to some such examples, the playback control module 605 may access a different instance of the conversational dynamics data files 515, corresponding with the changed time interval.

[00149] In some implementations a user may be able to control the reproduction of audio data not only according to typical commands such as pause, play, etc., but also according to additional capabilities based on a richer set of associated data and metadata. For example, in some implementations a user may be able to select for playback only the speech of a selected conference participant. In some examples, a user may be able to select for playback only those portions of a conference in which a particular keyword and/or a particular topic is being discussed.

[00150] In some implementations the graphical user interface 606 may display one or more word clouds based, at least in part, on the segment and word cloud data 309. In some implementations the displayed word clouds may be based, at least in part, on user input and/or on a particular portion of the conference that is being played back at a particular time. Various examples are disclosed herein.

[00151] Although various examples of audio data processing have been described above primarily in the teleconferencing context, the present disclosure is more broadly applicable to other known audio data processing contexts, such as processing audio data corresponding to in-person conferences. Such in-person conferences may, for example, include academic and/or professional conferences, doctor/client visits, personal diarization (e.g., via a portable recording device such as a wearable recording device), etc.

[00152] Figure 7 shows an example of an in-person conference implementation. The types and numbers of components shown in Figure 7 are merely shown by way of example. Alternative implementations may include more, fewer and/or different components. In this example, a conference location 700 includes a conference participant table 705 and a listener seating area 710. In this implementation, microphones 715a-715d are positioned on the conference participant table 705. Accordingly, the conference participant table 705 is set up such that each of four conference participants will have his or her separate microphone.

[00153] In this implementation, each of the cables 712a-712d convey an individual stream of audio data from a corresponding one of the microphones 715a-715d to a recording device 720, which is located under the conference participant table 705 in this instance. In alternative examples, the microphones 715a-715d may communicate with the recording device 720 via wireless interfaces, such that the cables 712a-712d are not required. Some implementations of the conference location 700 may include additional microphones 715, which may or may not be wireless microphones, for use in the listener seating area 710 and/or use in the area between the listener seating area 710 and the conference participant table 705.

[00154] In this example, the recording device 720 does not mix the individual streams of audio data, but instead records each individual stream of audio data separately. In some implementations, either the recording device 720 or each of the microphones 715a- 715d may include an analog-to-digital converter, such that the streams of audio data from the microphones 715a-715d may be recorded by the recording device 720 as individual streams of digital audio data.

[00155] The microphones 715a-715d may sometimes be referred to as examples of "endpoints," because they are analogous to the telephone endpoints discussed above in the teleconferencing context. Accordingly, the implementation shown in Figure 7 provides another example in which the audio data for each of multiple endpoints, represented by the microphones 715a-715d in this example, will be recorded separately.

[00156] In alternative implementations, the conference participant table 705 may include a microphone array, such as a soundfield microphone. The soundfield microphone may, for example, be capable of producing Ambisonic signals in A-format or B- format (such as the Core Sound TetraMic™), a Zoom H4n™, an MH Acoustics Eigenmike™, or a spatial speakerphone such as a Dolby Conference Phone™. The microphone array may be referred to herein as a single endpoint. However, audio data from such a single endpoint may correspond to multiple conference participants. In some implementations, the microphone array may be capable of detecting spatial information for each conference participant and of including the spatial information for each conference participant in the audio data provided to the the recording device 720.

[00157] In view of the foregoing, the present disclosure encompasses various implementations in which audio data for conference involving a plurality of conference participants may be recorded. In some implementations, the conference may be a teleconference whereas in other implementations the conference may be an in-person conference. In various examples, the audio data for each of multiple endpoints may be recorded separately. Alternatively, or additionally, recorded audio data from a single endpoint may correspond to multiple conference participants and may include spatial information for each conference participant.

[00158] Various disclosed implementations involve processing and/or playback of data recorded in either or both of the foregoing manners. Some such implementations involve determining a virtual conference participant position for each of the conference participants in a virtual acoustic space. Positions within the virtual acoustic space may be determined relative to a virtual listener's head. In some examples, the virtual conference participant positions may be determined, at least in part, according to the psychophysics of human sound localization, according to spatial parameters that affect speech intelligibility and/or according to empirical data that reveals what talker locations listeners have found to be relatively more or less objectionable, given the conversational dynamics of a conference.

[00159] In some implementations, audio data corresponding to an entire conference, or at least a substantial portion of a teleconference, may be available for determining the virtual conference participant positions. Accordingly, a complete or substantially complete set of conversational dynamics data for the conference may be determined. In some examples, the virtual conference participant positions may be determined at least in part, according to a complete or substantially complete set of conversational dynamics data for a conference.

[00160] For example, the conversational dynamics data may include data indicating the frequency and duration of conference participant speech. It has been found in listening exercises that many people object to a primary speaker in a conference being rendered to a virtual position behind, or beside the listener. When listening to a long section of speech from one talker (e.g., during a business presentation) many listeners report that they would like a sound source corresponding to the talker to be positioned in front of the listener, just as if the listener were present in a lecture or seminar. For long sections of speech from one talker, positioning behind or beside often evokes the comment that it seems unnatural, or, in some cases, that the listener's personal space is being invaded. Accordingly, the frequency and duration of conference participant speech may be useful input to a process of assigning and/or rendering virtual conference participant positions for a playback of an associated conference recording.

[00161] In some implementations, the conversational dynamics data may include data indicating instances of conference participant conversations. It has been found that rendering conference participants engaged in a conversation to substantially different virtual conference participant positions can improve a listener's ability to distinguish which conference participant is talking at any given time and can improve the listener's ability to understand what each conference participant is saying.

[00162] The conversational dynamics data may include instances of so-called

"doubletalk" during which at least two conference participants are speaking simultaneously. It has been found that rendering conference participants engaged in doubletalk to substantially different virtual conference participant positions can provide the listener an advantage, as compared with rendering conference participants engaged in doubletalk to the same virtual position. Such differentiated positioning provides the listener with better cues to selectively attend to one of the conference participants engaged in doubletalk and/or to understand what each conference participant is saying.

[00163] In some implementations, the conversational dynamics data may be applied as one or more variables of a spatial optimization cost function. The cost function may be a function of a vector describing a virtual conference participant position for each of a plurality of conference participants in a virtual acoustic space.

[00164] As noted above, accurate meeting transcription based on automatic speech recognition has proven to be a challenging task. In particular, the word error rate (WER) for meeting speech (conference speech involving multiple conference participants) has remained substantially higher than the WER for other types of speech. One reason for this persistently high WER is that people speak very differently to their colleagues or other conference participants than they do when they are consciously dictating to a machine. In real meetings, people tend to speak quickly, may use jargon that is specific to a certain group of people and/or particular field of endeavor, may discuss topics that were covered in a prior meeting that is not available for speech recognition, may use "filler" sounds such as "um" and "ah," may not finish their sentences, may interrupt each other, etc. Other challenges include the diversity of acoustic scenes that may be involved in conferences, because the audio data may be recorded by various devices (such as headsets, laptops, or spatial capture microphone systems). The participants of a teleconference may be located in rooms having a variety of acoustical properties. Conference participants may be using one or more different dialects of a language.

[00165] Despite the known high WER for meeting speech, prior attempts to generate meeting topics automatically were typically based on the assumption that the results of an ASR process performed on conference recordings produced a perfect transcript of words spoken by conference participants. (The results of an ASR process also may be referred to herein as "ASR results" or "speech recognition results data.")

[00166] This disclosure includes various novel techniques for searching speech recognition results data. As noted elsewhere herein, such searching processes may involve keyword spotting. The performance of keyword spotting may, for example, be evaluated in terms of miss rate (P(miss)) in trade-off with a false positive or "false alarm" rate (P(false-alarm)). The negative effects of false alarms can be diminished by a user's manual review. Therefore, many users may prefer to discover as much content of interest as possible, even if this means contending with a large number of false positive results. [00167] Accordingly, some implementations disclosed herein may involve a fuzzy term selection process for determining an expanded search query from an initial search query provided by a user. The fuzzy term selection process may involve determining an expanded search query according to phonetic similarity, according to semantic similarity, or according to both phonetic similarity and semantic similarity. Performing a search using such an expanded search query may reduce the miss rate. Particularly in view of the high WER for conference speech, using such an expanded search query has the potential advantage of reducing the miss rate by including more phonetically similar search results.

[00168] Figure 8 is a flow diagram that outlines one example of a method according to some implementations of this disclosure. In some examples, the method 800 may be performed by an apparatus, such as the apparatus of Figure 3A. The blocks of method 800, like other methods described herein, are not necessarily performed in the order indicated. Moreover, such methods may include more or fewer blocks than shown and/or described.

[00169] In this implementation, block 805 involves receiving speech recognition results data for at least a portion of an audio recording. According to some examples, the speech recognition results data may correspond to a recording of a complete or a substantially complete conference. According to some implementations, the conference may be a conference involving a plurality of conference participants, such as a teleconference, whereas in other implementations the conference may correspond to a monologue, such as a recording of a monologue for personal diarization.

[00170] In some examples, the speech recognition results data may include a plurality of speech recognition lattices and a word recognition confidence score for each of a plurality of hypothesized words of the speech recognition lattices, the word recognition confidence score corresponding with a likelihood of a hypothesized word correctly corresponding with an actual word spoken by a conference participant during the conference. In some implementations, a control system, such as the control system 330 of Figure 3A, may receive the speech recognition results data via the interface system 325 in block 805. The control system may, in some such implementations, be capable of performing the operations of blocks 810-825.

[00171] In this example, block 810 involves receiving an initial search query including at least one search word. The initial search query may, for example, be received via a user interface, such as a graphical user interface. For instance, the initial search query may be received as text typed into an area, such as a search window, of a graphical user interface. In other examples, the initial search query may be received as text produced by speech recognition software. For example, the speech recognition software may receive as input speech captured from a user via one or more microphones, which may be part of an interface system 325. In some examples, the initial search query may include a single word, multiple words, one or more phrases, one or more word trees (e.g., part of a lexicon), one or more word graphs (such as word lattices), one or more "wildcard" phrases, etc. Such wildcard phrases may, for example, include a single character, such as an asterisk, that can represent any ending of a word. For example, the search word "work*" may correspond not only with the word "work," but also with "works," "worker," "workers," "working," etc.

[00172] According to this implementation, block 815 involves analyzing the initial search query according to phonetic similarity and semantic similarity. In some examples, analyzing the initial search query may involve analyzing syllables and phonemes of the initial search query. Analyzing the initial search query according to phonetic similarity may, for example, involve applying a phonetic confusion model. In some such examples, the phonetic confusion model may be based, at least in part, on a phoneme and syllable cost matrix and a sub-word unit to word unit transducer. The transducer may refer to a lexicon to determine whether a hypothesized string of phonemes constitutes a valid word. In some examples, the transducer may be a weighted finite-state transducer (WFST). In some examples, analyzing the initial search query according to semantic similarity may involve applying a semantic confusion model that is based, at least in part, on a conference vocabulary, a lexical database for a language or both the conference vocabulary and the lexical database. Various examples are described in detail below.

[00173] In this example, block 820 involves determining an expanded search query according to the phonetic similarity, according to the semantic similarity, or according to both the phonetic similarity and the semantic similarity. According to some examples, block 820 may involve a selecting process of determining candidate search query terms and of selecting candidate search query terms to produce a refined search term list. In some such examples, the selecting process may involve selecting candidate search query terms according to one or more of the following: a limit of sub- word unit insertions; a limit of sub- word unit deletions; a limit of sub-word unit substitutions; exclusion of words not found in a conference vocabulary; a phonetic similarity cost threshold; a semantic similarity cost threshold; and user input.

[00174] According to some implementations, the selecting process may be performed, at least in part, according to user input. In some such implementations, the user input may include one or more of the following: a vocabulary size; an indication of an importance of phonetic similarity; an indication of an importance of semantic similarity; and an indication of a relative importance of reducing a miss rate and reducing a false positive rate. Some examples are described below.

[00175] In this implementation, block 825 involves performing a search of the speech recognition results data according to the expanded search query. In some examples, the search may involve a word unit search and a sub-word unit search. Some such examples may involve determining a word unit search score and a sub-word unit search score, and combining the word unit search score and the sub-word unit search score to determine a total search score.

[00176] Some implementations may involve returning search results corresponding to the search. Some such implementations may involve selecting, from the audio recording, playback audio data comprising one or more instances of speech that include the search results. Such examples also may involve providing the playback audio data for playback on a speaker system, such as the headphones 607 or the speaker array 608 that are shown in Figure 6. Some implementations may involve scheduling at least a portion of the instances of speech for simultaneous playback.

[00177] The audio recording may, in some examples, include at least a portion of a recording of a conference involving a plurality of conference participants. Some such examples may involve receiving an indication of a selected conference participant chosen by a user from among the plurality of conference participants. In some such examples, the indication may be received from a graphical user interface such as disclosed herein, e.g., from the graphical user interface 606 that is shown in Figure 6. According to some examples, selecting the playback audio data may involve selecting one or more instances of speech of the conference recording that include speech by the selected conference participant that include the search results.

[00178] Figure 9 is a block diagram that shows examples of modules that may be used to perform some of the methods disclosed herein. In the example shown in Figure 9, a fuzzy term selection module 905 is shown receiving search terms 901. This is one example of receiving an initial search query, as described above with reference to block 810 of Figure 8. According to some implementations, the fuzzy term selection module 905 may be capable of performing, at least in part, the operations described above with reference to blocks 815 and 820 of Figure 8. In some examples, the fuzzy term selection module 905, the core search module 909 and the post core search processing module 910 may be components of the control system 330 of Figure 3 A.

[00179] In this example, the fuzzy term selection module 905 is capable of analyzing the search terms 901 of an initial search query and determining an expanded search query 912 according to input from the phonetic confusion model 903 and the semantic confusion model 904. The phonetic confusion model 903 and the semantic confusion model 904 may, for example, have been determined prior to the time at which the search terms 901 are received by the fuzzy term selection module 905.

[00180] The phonetic confusion model 903 may define a similarity interface between search terms and candidate synophones. As used herein, the term "synophone" refers to a word having a pronunciation that is similar to, but not necessarily the same as, a pronunciation of another word. For example, "truck" is a synophone of "tuck," even though these words are not homonyms. Accordingly, a set of synophones for a word may include more members than a set of homonyms for that word. In some examples, the phonetic confusion model 903 may be based, at least in part, on a conference vocabulary. The phonetic confusion model 903 may be based, at least in part, on a phoneme and syllable cost matrix. In some examples, the phonetic confusion model 903 may be based, at least in part, on a sub-word unit to word unit transducer.

[00181] The semantic confusion model 904 may define a similarity interface between search terms and candidate synonyms. The semantic confusion model 904 may be based, at least in part, on a conference vocabulary, based on a lexical database for a language or based on both the conference vocabulary and the lexical database. In some examples, the lexical database may include WordNet®.

[00182] According to this example, the fuzzy term selection module 905 is capable of providing the expanded search query 912 to the core search module 909. In this implementation, the core search module 909 is capable of performing a search, such as described above with reference to block 825 of Figure 8, according to the expanded search query 912.

[00183] In the example shown in Figure 9, the core search module 909 is capable of performing the search with reference to the keyword index database 908. In this example, the keyword index database 908 is generated before the time that the expanded search query 912 is received by the core search module 909. According to some such examples, the conference lattice generator 906 and the keyword indexing module 907 may perform their operations offline, e.g., on a regularly scheduled basis, after conferences are concluded and the corresponding data are stored. In this example, the conference lattice generator 906 receives audio recordings of conferences and outputs utterance lattices 915 to the keyword indexing module 907. According to some examples, the conference lattice generator 906 may include a WFST (Weighted Finite-State Transducer) -based speech recognizer. The keyword indexing module 907 is capable of populating the keyword index database 908 according to data from the lattices 915.

[00184] According to some examples, the keyword indexing module 907 may be an instance of the keyword spotting and indexing module 505 that is described above with reference to Figure 5. In such examples, the keyword index database 908 may be an instance of a search index 310.

[00185] In the example shown in Figure 9, the post core search processing module 910 is capable of performing post-processing operations on core search results 920 that are received from the core search module 909. In some examples, the core search results 920 may include results from different sources, for a single search query. Each of the results may have a different score or rank. In some examples, the post core search processing module 910 may be capable of performing a score fusion and merging process, such as that described below.

[00186] Figure 10 is a block diagram that shows examples of inputs for a phonetic confusion model and a semantic confusion model. As noted elsewhere herein, the phonetic confusion model 903 and the semantic confusion model 904 may be prepared offline, prior to receiving search terms 901 for a particular search. Pre-calculating these models can allow the construction of sorted phonetic and semantic similarity candidate search term lists, which may be used in a subsequent searching process.

[00187] In this example, the conference dictionary 1007 includes a pronunciation dictionary 1005 and a semantic dictionary 1006. Here, the pronunciation dictionary 1005 includes word pronunciation input from a general lexicon 1001 and from the conference vocabulary 1002. The general lexicon 1001 may, for example, be a dictionary for a particular language, such as the Carnegie Mellon University (CMU) dictionary, the BEEP dictionary, the Oxford English Dictionary, the Merriam- Webster Dictionary, etc. The conference vocabulary 1002 is a word set that may define, at least in part, the scope for generating candidate search terms. In some examples, the conference vocabulary 1002 may be based, at least in part, on words that have been uttered in prior conferences, such as prior conferences regarding a similar topic. In some examples, the pronunciation dictionary 1005 may be generated by selecting words from the general lexicon 1001 that are also included in the conference vocabulary 1002.

[00188] According to this implementation, the semantic dictionary 1006 is based, at least in part, on the lexical database 1003. The lexical database 1003 may be a lexical resource that has an associated software environment database that permits access to the contents of the lexical database 1003. The lexical database 1003 may include lexical categories and synonyms of words, as well as semantic relations between different words or sets of words. The lexical categories may vary, depending on the underlying language. The lexical categories for the English language may, for example, include nouns, verbs, adjectives, adverbs, pronouns, prepositions, conjunctions, interjections, etc. In some examples, the lexical database 1003 may include WordNet®.

[00189] In this implementation, the semantic dictionary 1006 is based on semantic information from the conference vocabulary 1002 and from the lexical database 1003. For example, the semantic dictionary 1006 may be generated by selecting words from the lexical database 1003 that are also included in the conference vocabulary 1002. According to this example, the semantic confusion model 904 is based on the semantic dictionary 1006.

[00190] In this example, the phonetic confusion model 903 is based on input from the phoneme and syllable cost matrix 1004 and on input from the pronunciation dictionary 1005. The phoneme and syllable cost matrix 1004 may indicate the phonetic similarity between sub-word units, such as phonemes and syllables. As used herein, a "sub- word unit" may be a syllable or a phoneme. The phoneme and syllable cost matrix 1004 may, in some examples, include a global cost matrix for defining the cost of converting a sub-word unit to another sub-word unit. Accordingly, phoneme and syllable cost matrix 1004 may indicate the cost of converting a phoneme to another phoneme and the cost of converting a syllable to another syllable. According to some implementations, the phoneme and syllable cost matrix 1004 may be generated by a WFST.

[00191] Figure 11 provides an example of determining costs for a phoneme and syllable cost matrix. Although this specific example is for determining phoneme-to- phoneme (P2P) costs, a similar process may be used to determine other costs, such as syllable-to-syllable (S2S) costs. The process of Figure 11 may be viewed as a training process.

[00192] This example involves collecting P2P costs with the aid of a speech recognizer of an ASR program. This ASR program may or may not be the same type of ASR program used in the automatic speech recognition module 405 that is described above with reference to Figure 4. According to this example, based on input from an acoustic model 1101 and on a reference transcription 1102 that has previously been made of the target speech 1103, in block 1104 the speech recognizer determines the starting and ending time for each phoneme of the target speech 1103. Here, a reference phoneme boundary 1108 is determined by the forced alignment of block 1104 between the reference transcription 1102 and the target speech 1103.

[00193] In this example, the target speech 1103 is provided to a phoneme recognizer in block 1105. Here, a hypothesized transcription 1106 is made according to the output of block 1105. According to this example, block 1107 involves another forced alignment. However, no reference transcription is provided as input to block 1107 in this implementation. Instead, a hypothesized phoneme boundary 1109 is determined according to a forced alignment between the hypothesized transcription 1106 and the target speech 1103. Finally, a frame-level histogram between reference and hypothesized phonemes (which may be represented as hist(P re f, P hyp )) is calculated in block 1111. In this example, the histogram is a two-dimensional histogram. In some examples, the higher the histogram value, the greater the similarity and the lower the P2P cost between two phonemes.

[00194] Figure 12 shows an example of a small phoneme confusion matrix and a WFST for two phones labeled phnl and phn2. In the example shown in Figure 12, the insertion and deletion costs, also referred to herein as penalties, have the highest cost of 1.0. Substitution of an identical phoneme corresponds with the lowest cost, which is 0.0 in this example. In this example, the other intermediate substitution costs are calculated according to the difference between the histogram of a reference phoneme and hypothesized phoneme across all frames and normalized by the maximum number in the grid, e.g., according to the following expression:

Cost(P re f, P hyp ) =1- hist (P re f, P hyp ) / m ix(hist (P h P j )), l < =i< =N, l < =j<=N

[00195] Assuming that N represents the total number of phoneme types, the cost will be stored in the weight of the transducer in parallel arcs, as shown in the right portion of Figure 12.

[00196] For most language systems, the pronunciation dictionary 1005 may be grapheme-based and can be expanded via one or more lexicons. As noted above, in some implementations the pronunciation dictionary 1005 may be generated by selecting words from the general lexicon 1001 that are also included in the conference vocabulary 1002. However, in some examples, the pronunciation dictionary 1005 may be, or may be similar to, the Carnegie Mellon University (CMU) pronouncing dictionary. According to some examples, a phoneme-to-word (P2W) confusion matrix and a syllable-to-word (S2W) confusion matrix may be converted from such a lexicon. Depending on the expected performance in terms of P(miss) and P(false-alarm), dictionary-derived lexicons of different sizes may be selected to generate a transducer.

[00197] Figure 13 shows an example of phoneme-to-word generation from a conference pronunciation dictionary. The simple examples shown in Figure 13 correspond to the words "yes," "no" and "yeah." In this example, each arc represents a phonetic or syllable series in the input label and the corresponding word is shown in the output label. For example, the uppermost arc corresponds with the word "yes." In the example shown in Figure 13, the arcs that go directly from start node 0 to end node 5 correspond with pauses between sentences and the backward epsilon arc from end node 5 to start node 0 corresponds with moving to another word, in order to sequentially process each word in a word string.

[00198] Figure 14 shows an example of a simple synonym-to-word (N2W) process. A process like that shown in Figure 14 may, for example, be part of a process of developing a semantic confusion model. In this example, components of the semantic confusion model are being generated according to input from the lexical database 1003 and the conference dictionary 1007. The lexical database 1003 may include lexical categories and synonyms of words, as well as semantic relations between different words or sets of words. A semantic dictionary 1006 of the conference dictionary 1007 may provide input for the process.

[00199] The process illustrated in Figure 14 may involve grouping lexical categories, such as nouns, verbs, adjectives, and adverbs, into set of cognitive synonyms, each of which expresses a distinct concept. In this example, the lexical database 1003 is used to determine synonyms for the word "truck." As shown in the right part of Figure 14, in each arc linking the start node zero and the end node 1, the input label is a synonym of the output labeled word, included with a corresponding semantic "cost." For example, the word "lorry" is semantically similar to the word "truck," so the corresponding cost is low.

[00200] Figure 15 is a block diagram that shows more detailed examples of elements that may be involved in a conference keyword search process. In this example, the fuzzy term selection module 905 includes a phonetic similarity processing unit 1504 and a semantic similarity processing unit 1505. Here, the phonetic similarity processing unit 1504 receives input from the phonetic confusion model 903 and the semantic similarity processing unit 1505 receives input from the semantic confusion model 904. [00201] According to some examples, the phonetic similarity processing unit

1504 is capable of receiving search terms 901 of an initial search query and of finding phonetically similar candidates (if any exist) in the conference dictionary 1007 (e.g., in the pronunciation dictionary 1005 of the conference dictionary 1007 that is shown in Figure 10), according to the phonetic confusion model 903. If phonetically similar candidates are found, such candidates may be used as the search terms instead of, or in addition to, the initial search query. Such a process has the potential advantage of providing boundaries, corresponding with the sizes of the conference vocabulary 1002 and the pronunciation dictionary 1005, to the scope of the search query expansion. Such processes may have the potential advantage of saving an additional sub-unit index building process and of providing complementary scores that may be compared with scores of a sub-word unit based approach. According to some implementations, the process may constrain the search terms to tokens only appearing in a word lattice, which can improve precision.

[00202] According to this example, the keyword index database 908 includes a word unit index 1501, a syllable unit index 1502 and a phoneme unit index 1503. Accordingly, the core search module 909 is capable of performing searches on the word unit index 1501, the syllable unit index 1502 and the phoneme unit index 1503.

[00203] In this implementation, the core search module 909 includes a word unit search module 1506 that is capable of performing searches of the word unit index 1501 based on input from the phonetic similarity processing unit 1504 and based on input from the semantic similarity processing unit 1505. Here, the core search module 909 also includes a sub- word unit search module 1507 that is capable of performing sub- word searches of the syllable unit index 1502 and the phoneme unit index 1503, based on input from the phonetic similarity processing unit 1504.

[00204] In this example, the post core search processing module 910 includes a score fusion and processing module 1508 that is capable of generating confidence-sorted results 1509 as output. Here, the score fusion and processing module 1508 receives output from the word unit search module 1506 and the sub- word unit search module 1507. The output may include scores, which may correspond to costs or confidence scores. Further details regarding the elements shown in Figure 15 are provided below.

[00205] Figure 16 is a flow diagram that shows an example of determining an expanded search query for a word unit search. In this example, the expanded search query is determined according to phonetic similarity and semantic similarity. At the top of Figure 16, which represents the beginning of the process 1600, search terms 901 of an initial search query are received. Phoneme-to-word and syllable-to-word finite-state transducers (FSTs) P2W and S2W are received from the phonetic confusion model 903. A synonym-to-word FST N2W is received from the semantic confusion model 904.

[00206] In this example, a query FST corresponding with the initial search query is composed with the phoneme-to-word FST to obtain a phoneme-to-query (P2Q) FST. The query FST is also composed with the syllable-to-word FST S2W and the synonym-to-word FST N2W, to obtain a syllable-to-query FST S2Q and a synonym-to-query FST FN2Q.

[00207] In this implementation, the query lexicon Q2P is determined as the inverse of P2Q and the query lexicon Q2S is determined as the inverse of S2Q. According to this example, a fuzzy phoneme transducer P2P is received from the phonetic confusion model 903 and is composed with the query lexicon Q2P. The result is pruned to produce a potential alternative pronunciation lexicon Q2FP. Similarly, a fuzzy syllable transducer S2S is received from the phonetic confusion model 903 and is composed with the query lexicon Q2S. The result is pruned to produce a potential alternative pronunciation lexicon Q2FS.

[00208] According to this example, Q2FP is composed with the phoneme-to- word FST (P2W) to obtain Q2Qpp, which is input to a pruning and determinization process. Similarly, Q2FS is composed with the syllable-to-word FST (S2W) to obtain Q2Q F s, which is input to a pruning and determinization process. The pruning process may, for example, involve removing WFST branches corresponding to unlikely paths, e.g., according to branch length, weighting factor or probability, etc. According to some examples, the determinization processes may involve building an equivalent transducer that has a unique initial state, such that no two transitions leaving any state share the same input label. One purpose of such a determinization process is to keep a unique input label in every arc. In this implementation, the results of both pruning and determinization processes, along with Q2FN, are input to a process of selecting the best n candidates via tracing back the top N shortest paths in each of the input FSTs. Selecting the shortest path may involve removing unlikely alternatives in the generated candidate search terms. According to some implementations, selecting the shortest path may involve performing a dynamic time- warping algorithm such as the Longest Common Subsequence algorithm, the Edit Distance with Real Penalty algorithm or the Time Warp Edit Distance algorithm, e.g., as described in Marteau, P.F, "Time Warp Edit Distances with Stiffness Adjustment for Time Series Matching" (University of Bretagne Sud, February 2007), which is hereby incorporated by reference. However, other implementations may not involve a time-warping algorithm. Some WFST-based implementations may provide advantages, as compared to implementations that involve a time-warping algorithm. Some such implementations may take advantage of graph processing and simplification, to solve these problems implicitly

[00209] In this example, the output labels of the FSTs are then projected to the input labels to produce Q F p, Q FN and Q F s, which correspond to fuzzy phoneme, fuzzy synonym and fuzzy syllable candidate search terms. According to this implementation, the candidate search terms of Q FP , Q FN and Q FS are input to an FST interpolation process, to produce the final expanded search query Q F A- In the field of machine learning, combining the candidate search terms may be regarded as an example of a "mixture of experts." According to some implementations, a simple linear model may be efficiently implemented in word-level compositions between each Q F p Q FS and Q FN transducer and their corresponding weights, according to the following formula:

Q FY4 = nshortest(det ((Q FP ° W fp ) U (Q FS ° W FS ) U (Q FN ° W F N) ) )) (Equation 1)

[00210] In Equation 1 , Wpp, W FN and W FS represent the weights of Q F p, Q FN and Q FS , respectively, and "det" refers to determinization. In this example, Wpp, W FN and W FS represent weight transducers after conversion to a negative log score. Given that you have three original weights λ Ρ ρ , A F s and λ ΡΝ for three experts summing to 1.0, after converting to negative log score, these weights will be added to the weight in every arc of original Q FP Q FS and Q FN . Some implementations may involve methods such as those described in the publication by X. Liu, M.J.F. Gales, J.L. Hieronymus and P.C. Woodland entitled "Language Model Combination and Adaptation Using Weighted Finite State Transducers," which is hereby incorporated by reference. According to this example, a shortest-distance operation is then employed to filter out the top-N most likely search terms and their corresponding scores. In the example shown in Figure 16, the result of the foregoing series of operations is provided to the word unit search module 1506.

[00211] Figure 17 is a flow diagram that shows an example of determining an expanded search query for a sub-word unit search. According to some implementations, the operations of method 1700 may be performed in parallel with those of method 1600. In this example, FST representations of the phoneme-to-word and syllable-to-word confusion matrices P2W and S2W are received from the phonetic confusion model 903. Moreover, FST representations of the phoneme-to- phoneme and syllable-to- syllable confusion matrices P2P and S2S are also received from the phonetic confusion model 903. In this example, the lexicon transducers W2P and W2S shown in Figure 17 are determined by inverting P2W and S2W, respectively.

[00212] In this example, candidate queries QPp and QSF are generated according to the following equations:

QPF= project(nshortest(prune(Q o W2P o P2P))) (Equation 2) QSF = project(nshortest(prune(Q o W2S o 525))) (Equation 3)

[00213] In Equation 2, for example, the first operation is a composition of FST representations of the initial search query, of W2P and of P2P, followed by a pruning operation. According to this example, a shortest-distance operation is then employed (represented as "nshortest" in Equations 2 and 3) to filter out the top-N most likely search terms and their corresponding scores. In this example, the output labels of the FSTs are then projected to the input labels (as represented by "project" in Equations 2 and 3). The amount of query expansion may be controlled by varying the number of hypotheses kept after the composition process, for example by modifying the pruning process or by modifying nshortest process.

[00214] After expanded search queries have been determined for the sub-word unit search and the word unit search, a core word search may be performed. According to some examples, the core word search may be performed by the core search module 909. According to some such examples, these expanded search queries may be provided to the word unit search module 1506 and the sub- word unit search module 1507 of the core search module 909 (see Figure 15).

[00215] In some implementations, the word search and may be performed according to the following equations:

POSQ FP = nshortest(det(proj(Q_ FP o Index w )) (Equation 4) POSQ FS = nshortest(det(proj(Q FS o Index w )) (Equation 5) POS QFN = nshortest (det(proj(Q FN ° Index w )) (Equation 6) POS QFA = nshortest(det(proj(Q FA o Index w )) (Equation 7) POSQ SF = nshortest(det(j>roj(QS F o Index s )) (Equation 8) Pos Q p F = nshortest(det(j>roj(QPF o Index p )) (Equation 9)

[00216] In Equations 4-7, the process begins with a composition of an FST representation of candidate search word terms with a word unit based index (represented by Indexw)- The word unit based index may, for example, correspond with the word unit index 1501 that is shown providing input to the word unit search module 1506 in Figure 15. The result of this composition may be a target weight, which may include information regarding each corresponding utterance lattice ID and the start and end time of each utterance. According to these examples, a projection process follows each of the composition operations. In these examples, in order to control the output number of search results, determinization, minimization and top-N shortest distance operations are employed after the projections. Some implementations may involve methods such as those described in the publication by M. Mohri C. Allauzen and M. Saraclar entitled "General indexation of weighted automata application to spoken utterance retrieval," in Proc. HLT/NAACL, 2004, vol. I., pp. 33-40, which is hereby incorporated by reference.

[00217] In Equations 8 and 9, the process begins with a composition of an FST representation of candidate search sub-word terms with a sub-word unit based index (represented by Indexs and Indexp in Equations 8 and 9). The sub- word unit based indices may, for example, correspond with the syllable unit index 1502 and the phoneme unit index 1503 that are shown providing input to the sub- word unit search module 1507 in Figure 15. Projection, determinization and "nshortest" operations follow the composition operations in these examples.

[00218] According to some implementations, score fusion and merging processes may follow operations like those of Equations 4-9. According to some such implementations, the output of Equations 4-9 may be provided to the post core search processing module 910.

[00219] In some implementations, score fusion and merging processes may emphasize the reduction of missed search results of interest (which may be represented as a reduction of P(miss)), which will generally lead to an increase of false positive search results (which may be represented as an increase of P(false-alarm). Given one word query, there may be multiple search results having outputs corresponding to one time interval. For example, in a search for instances of the word "truck", one search result may return an instance of "truck" in a time interval from 10.1 seconds to 11.2 seconds, whereas another search result may return an instance of "tuck" in a time interval from 10.3 seconds to 10.8 seconds and another search result may return an instance of "trucked" in a time interval from 10.5 seconds to 11.5 seconds. When results from different sources for a specific target query are merged, as long as one of the sources is assigned a correct position the result will reduce P(miss). Some implementations disclosed herein involve aligning search results provided by each individual search source within a time window. In some such examples, within each time window only the search result having the best score will be used in the subsequent fusion process. For example, if instances of "truck," "tuck" and "trucked" were returned as described above, some implementations may only use the result having the highest score (e.g., "truck") in the subsequent fusion process.

[00220] Figure 18 shows an example of merging search results from different sources. The smaller rectangles shown in Figure 18, shown within the larger window scope rectangles, represent search results from a core search process. These search results are also referred to herein as detection candidates for fusion. The knowledge sources 1-3 correspond to sources of the search results. In the foregoing examples, such knowledge sources include word unit and sub- word unit searches. The horizontal lines shown in Figure 18 represent time. Window scope 1 and window scope 2 represent two time intervals within which detection candidates from multiple sources are being evaluated.

[00221] In this example, each detection candidate includes a start time, an end time and a score. Detection candidates may be evaluated in different ways, depending on the particular implementation. In some fusion examples, only the highest score is selected. Such examples may have a relatively higher risk of missing search results of interest (increased P(miss)), but will generally lead to an decrease of false positive search results (decreased P(false-alarm).

[00222] Other examples of fusion may involve combining scores in some manner. For example, one implementation involves making a linear combination of each score, using a priori information regarding the importance of a particular knowledge source. A user may, for example, choose to emphasize expansion of search terms according to either semantic similarity or phonetic similarity. If the user is very certain regarding the pronunciation of a query term, for example, the user may want to obtain as many semantically similar results as possible and may indicate a higher weight for semantic similarity.

[00223] The results of a fusion process may be sorted in some manner prior to output. In some implementations, the results may be sorted in descending order, according to the post- fusion scores.

[00224] Seeking the best combination of search parameters can be a challenging task. However, by taking the advantage of user input, search parameter combinations can be explicitly acquired in some implementations.

[00225] A user's input may vary according to the particular implementation.

In some examples, the user's input may include a vocabulary size, e.g., for generating W2P or W2S. The user's input may indicate the relative importance of semantic or phonetic similarity. In some examples, the user's input may correspond with a maximum number of search results. Alternatively, or additionally, the user's input may indicate the relative importance of reducing the miss rate or of avoiding false positive search results. The user's selections may be stored.

[00226] A search may be performed according to various factors and search results 1920 may be provided to the user. The user may provide further input according to the search results in order to refine the search process. Such further input may, directly or indirectly, indicate whether the search results were generally acceptable or were not acceptable. In some examples, the system may choose, or suggest to the user, search parameters that have previously resulted in a successful search.

[00227] Figure 19 shows an example of a graphical user interface (GUI) for receiving user input regarding search parameters. The display device 610 that is shown in Figure 19 may be another instance of the display device 610 shown in Figure 6 and described above. In this example, the GUI 606 includes slider bars 1905a- 1905d, which are provided on the display 1910. In this example, the display device 610 includes a touch sensor system, one component of which is the touch panel 1915 that overlies the display 1910.

[00228] According to this implementation, the slider bar 1905a allows input regarding vocabulary size for generating W2P and W2S. Here, the slider bar 1905b allows input regarding the importance of phonetic similarity and the slider bar 1905c allows input regarding semantic similarity. In this example, a user's finger 1912 is shown interacting with a slider 1920 of the slider bar 1905d to select a relative importance of reducing the miss rate. Other implementations may provide different user interfaces and may allow input regarding other search parameters.

[00229] In some examples, information relating to search results may be displayed in the area 1925. According to some such examples, the information displayed in the area 1925 may include a GUI for playback of audio data that corresponds with search results. The GUI may, in some examples, be similar to other GUIs provided herein, such as the GUI shown in Figure 27. However, instead of indicating conference topics, as shown in Figure 27, in some such implementations the GUI may indicate the text of search results. Some such examples may involve the simultaneous playback of audio data corresponding to two or more portions of a conference and/or may involve spatial audio playback, e.g., as described elsewhere herein. In some examples, the GUI may indicate waveforms, conference participants, etc. [00230] After reviewing at least some of the search results, a user may interact with one or more of the slider bars 1905a-1905d in order to modify one or more search parameters. For example, if the user wanted search results pertaining to the word "lorry" but had only obtained search results pertaining to the word "truck," the user might choose to increase the importance of semantic similarity by sliding a slider 1920 of the slider bar 1905c towards the High end. The user might continue to review search results and interact with the GUI 606 until the user is satisfied with the search results.

[00231] According to some implementations, many hypotheses for a given utterance (e.g., as described in a speech recognition lattice) may contribute to a word cloud. In some examples, a whole-conference (or a multi-conference) context may be introduced by compiling lists of alternative hypotheses for many words found in an entire conference and/or found in multiple conferences. Some implementations may involve applying a whole-conference (or a multi-conference) context over multiple iterations to re-score the hypothesized words of speech recognition lattices (e.g., by de-emphasizing less-frequent alternatives), thereby removing some utterance-level ambiguity.

[00232] In some examples, a "term frequency metric" may be used to sort primary word candidates and alternative word hypotheses. In some such examples, the term frequency metric may be based, at least in part, on a number of occurrences of a hypothesized word in the speech recognition lattices and the word recognition confidence score reported by the speech recognizer. In some examples, the term frequency metric may be based, at least in part, on the frequency of a word in the underlying language and/or the number of different meanings that a word may have. In some implementations, words may be generalized into topics using an ontology that may include hypernym information.

[00233] Figure 20 is a flow diagram that outlines blocks of some topic analysis methods disclosed herein. The blocks of method 2000, like other methods described herein, are not necessarily performed in the order indicated. Moreover, such methods may include more or fewer blocks than shown and/or described.

[00234] In some implementations, method 2000 may be implemented, at least in part, via instructions (e.g., software) stored on non-transitory media such as those described herein, including but not limited to random access memory (RAM) devices, readonly memory (ROM) devices, etc. In some implementations, method 2000 may be implemented, at least in part, by an apparatus such as that shown in Figure 3A. According to some such implementations, method 2000 may be implemented, at least in part, by one or more elements of the analysis engine 307 shown in Figures 3C and 5, e.g., by the joint analysis module 306. According to some such examples, method 2000 may be implemented, at least in part, by the topic analysis module 525 of Figure 5.

[00235] In this example, block 2005 involves receiving speech recognition results data for at least a portion of a conference recording of a conference involving a plurality of conference participants. In some examples, speech recognition results data may be received by a topic analysis module in block 2005. Here, the speech recognition results data include a plurality of speech recognition lattices and a word recognition confidence score for each of a plurality of hypothesized words of the speech recognition lattices. In this implementation, the word recognition confidence score corresponds with a likelihood of a hypothesized word correctly corresponding with an actual word spoken by a conference participant during the conference. In some implementations, speech recognition results data from two or more automatic speech recognition processes may be received in block 2005. Some examples are described below.

[00236] In some implementations, the conference recording may include conference participant speech data from multiple endpoints, recorded separately. Alternatively, or additionally the conference recording may include conference participant speech data from a single endpoint corresponding to multiple conference participants and including information for identifying conference participant speech for each conference participant of the multiple conference participants.

[00237] In the example shown in Figure 20, block 2010 involves determining a primary word candidate and one or more alternative word hypotheses for each of a plurality of hypothesized words in the speech recognition lattices. Here, the primary word candidate has a word recognition confidence score indicating a higher likelihood of correctly corresponding with the actual word spoken by a conference participant during the conference than a word recognition confidence score of any of the alternative word hypotheses.

[00238] In this implementation, block 2015 involves calculating a "term frequency metric" for the primary word candidates and the alternative word hypotheses. In this example, the term frequency metric is based, at least in part, on a number of occurrences of a hypothesized word in the speech recognition lattices and on the word recognition confidence score.

[00239] According to some examples, the term frequency metric may be based, at least in part, on a "document frequency metric." In some such examples, the term frequency metric may be inversely proportional to the document frequency metric. The document frequency metric may, for example, correspond to an expected frequency with which a primary word candidate will occur in the conference.

[00240] In some implementations, the document frequency metric may correspond to a frequency with which the primary word candidate has occurred in two or more prior conferences. The prior conferences may, for example, be conferences in the same category, e.g., business conferences, medical conferences, engineering conferences, legal conferences, etc. In some implementations, conferences may be categorized by subcategory, e.g., the category of engineering conferences may include sub-categories of electrical engineering conferences, mechanical engineering conferences, audio engineering conferences, materials science conferences, chemical engineering conferences, etc. Likewise, the category of business conferences may include sub-categories of sales conferences, finance conferences, marketing conferences, etc. In some examples, the conferences may be categorized, at least in part, according to the conference participants.

[00241] Alternatively, or additionally, the document frequency metric may correspond to a frequency with which the primary word candidate occurs in at least one language model, which may estimate the relative likelihood of different words and/or phrases, e.g., by assigning a probability to a sequence of words according to a probability distribution. The language model(s) may provide context to distinguish between words and phrases that sound similar. A language model may, for example, be a statistical language model such as a unigram model, an N-gram model, a factored language model, etc. In some implementations, a language model may correspond with a conference type, e.g., with the expected subject matter of a conference. For example, a language model pertaining to medical terms may assign higher probabilities to the words "spleen" and "infarction" than a language model pertaining to non-medical speech.

[00242] According to some implementations, conference category, conference sub-category, and/or language model information may be received with the speech recognition results data in block 2005. In some such implementations, such information may be included with the conference metadata 210 received by the topic analysis module 525 of Figure 5.

[00243] Various alternative examples of determining term frequency metrics are disclosed herein. In some implementations, the term frequency metric may be based, at least in part, on a number of word meanings. In some such implementations, the term frequency metric may be based, at least in part, on the number of definitions of the corresponding word in a standard reference, such as a particular lexicon or dictionary. [00244] In the example shown in Figure 20, block 2020 involves sorting the primary word candidates and alternative word hypotheses according to the term frequency metric. In some implementations, block 2020 may involve sorting the primary word candidates and alternative word hypotheses in descending order of the term frequency metric.

[00245] In this implementation block 2025 involves including the alternative word hypotheses in an alternative hypothesis list. In some implementations, iterations of at least some processes of method 2000 may be based, at least in part, on the alternative hypothesis list. Accordingly, some implementations may involve retaining the alternative hypothesis list during one or more such iterations, e.g., after each iteration.

[00246] In this example, block 2030 involves re-scoring at least some hypothesized words of the speech recognition lattices according to the alternative hypothesis list. In other words, a word recognition confidence score that is received for one or more hypothesized words of the speech recognition lattices in block 2005 may be changed during one or more such iterations of the determining, calculating, sorting, including and/or re- scoring processes. Further details and examples are provided below.

[00247] In some examples, method 2000 may involve forming a word list that includes primary word candidates and a term frequency metric for each of the primary word candidates. In some examples, the word list also may include one or more alternative word hypotheses for each primary word candidate. The alternative word hypotheses may, for example, be generated according to a language model.

[00248] Some implementations may involve generating a topic list of conference topics based, at least in part, on the word list. The topic list may include one or more words of the word list. Some such implementations may involve determining a topic score. For example, such implementations may determine whether to include a word on the topic last based, at least in part, on the topic score. According to some implementations, the topic score may be based, at least in part, on the term frequency metric.

[00249] In some examples, the topic score may be based, at least in part, on an ontology for topic generalization. In linguistics, a hyponym is a word or phrase whose semantic field is included within that of another word, known as its hypernym. A hyponym shares a "type-of" relationship with its hypernym. For example, "robin," "starling," "sparrow," "crow" and "pigeon" are all hyponyms of "bird" (their hypernym); which, in turn, is a hyponym of "animal." [00250] Accordingly, in some implementations generating the topic list may involve determining at least one hypernym of one or more words of the word list. Such implementations may involve determining a topic score based, at least in part on a hypernym score. In some implementations, the hypernyms need not have been spoken by a conference participant in order to be part of the topic score determination process. Some examples are provided below.

[00251] According to some implementations, multiple iterations of a least some processes of method 2000 may include iterations of generating the topic list and determining the topic score. In some such implementations, block 2025 may involve including alternative word hypotheses in the alternative hypothesis list based, at least in part, on the topic score. Some implementations are described below, following some examples of using hypernyms as part of a process of determining a topic score.

[00252] In some examples, method 2000 may involve reducing at least some hypothesized words of a speech recognition lattice to a canonical base form. In some such examples, the reducing process may involve reducing nouns of the speech recognition lattice to the canonical base form. The canonical base form may be a singular form of a noun. Alternatively, or additionally, the reducing process may involve reducing verbs of the speech recognition lattice to the canonical base form. The canonical base form may be an infinitive form of a verb.

[00253] Figure 21 shows examples of topic analysis module elements. As with other implementations disclosed herein, other implementations of the topic analysis module 525 may include more, fewer and/or other elements. The topic analysis module 525 may, for example, be implemented via a control system, such as that shown in Figure 3A. The control system may include at least one of a general purpose single- or multi-chip processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. In some implementations, the topic analysis module 525 may be implemented via instructions (e.g., software) stored on non- transitory media such as those described herein, including but not limited to random access memory (RAM) devices, read-only memory (ROM) devices, etc.

[00254] In this example, the topic analysis module 525 is shown receiving speech recognition lattices 2101. The speech recognition lattices 2101 may, for example, be instance of speech recognition results such as the speech recognition results 401F-405F that are described above with reference to Figures 4 and 5. Some examples of speech recognition lattices are described below.

[00255] This example of the topic analysis module 525 includes a lattice rescoring unit 2102. In some implementations, the lattice rescoring unit 2102 may be capable of re-scoring at least some hypothesized words of the speech recognition lattices

2101 according to the alternative hypothesis list. For example, the lattice rescoring unit

2102 may be capable of changing the word recognition confidence score of hypothesized words that are found in the alternative hypothesis list 2107 such that these hypothesized words are de-emphasized. This process may depend on the particular metric used for the word recognition confidence score. For example, in some implementations a word recognition confidence score may be expressed in terms of a cost, the values of which may be a measure of how unlikely a hypothesized word is to be correct. According to such implementations, de-emphasizing such hypothesized words may involve increasing a corresponding word recognition confidence score.

[00256] According to some implementations, the alternative hypothesis list

2107 may initially be empty. If so, the lattice rescoring unit 2102 may perform no re-scoring until a later iteration.

[00257] In this example, the topic analysis module 525 includes a lattice pruning unit 2103. The lattice pruning unit 2103 may, for example, be capable of performing one or more types of lattice pruning operations (such as beam pruning, posterior probability pruning and/or lattice depth limiting) in order to reduce the complexity of input the speech recognition lattices 2101.

[00258] Figure 22 shows an example of an input speech recognition lattice.

As shown in Figure 22, un-pruned speech recognition lattices can be quite large. The circles in Figure 22 represent nodes of the speech recognition lattice. The curved lines or "arcs" connecting the nodes correspond with hypothesized words, which may be connected via the arcs to form hypothesized word sequences.

[00259] Figure 23, which includes Figures 23 A and 23B, shows an example of a portion of a small speech recognition lattice after pruning. In this example, the pruned speech recognition lattice corresponds to a first portion of the utterance "I accidentally did not finish my beef jerky coming from San Francisco to Australia." In this example, alternative word hypotheses for the same hypothesized word are indicated on arcs between numbered nodes. Different arcs of the speech recognition lattice may be traversed to form alternative hypothesized word sequences. For example, the hypothesized word sequence "didn't finish" is represented by arcs connecting nodes 2, 6 and 8. The hypothesized word sequence "did of finish" is represented by arcs connecting nodes 5, 11, 12 and 15. The hypothesized word sequence "did of finished" is represented by arcs connecting nodes 5, 11, 12 and 14. The hypothesized word sequence "did not finish" is represented by arcs connecting nodes 5, 11 and 17-20. The hypothesized word sequence "did not finished" is represented by arcs connecting nodes 5, 11, 17 and 18. All of the foregoing hypothesized word sequences correspond to the actual sub-utterance "did not finish."

[00260] In some speech recognition systems, the speech recognizer may report a word recognition confidence score in terms of a logarithmic acoustic cost CA, which is a measure of how unlikely this hypothesized word on this path through the lattice is to be correct, given the acoustic input features to the speech recognizer. The speech recognizer also may report a word recognition confidence score in terms of a logarithmic language cost C L , which is a measure of how unlikely this hypothesized word on this path through the lattice is to be correct given the language model. The acoustic and language costs may be reported for each arc in the lattice.

[00261] For each arc in the lattice portion shown in Figure 23, for example, the combined acoustic and language cost (CA + C L ) for that arc is shown next to each hypothesized word. In this example, the best hypothesized word sequence through the speech recognition lattice corresponds with the path from the start node to an end node that has the lowest sum of arc costs.

[00262] In the example shown in Figure 21, the topic analysis module 525 includes a morphology unit 2104. The morphology unit 2104 may be capable of reducing hypothesized words to a canonical base form. For example, in some implementations that involve reducing nouns of the speech recognition lattice to the canonical base form, the morphology unit 2104 may be capable of reducing plural forms of a noun to singular forms (for example, reducing "cars" to "car"). In some implementations that involve reducing verbs of the speech recognition lattice to the canonical base form, the morphology unit 2104 may be capable of reducing a verb to an infinitive form (for example, reducing "running," "ran," or "runs" to "run").

[00263] Alternative implementations of the morphology unit 2104 may include a so-called "stemmer," such as a Porter Stemmer. However, a basic stemmer of this type may not be capable of accurately transforming irregular noun or verb forms (such as reducing "mice" to "mouse"). A more accurate morphology implementation may be needed for such transformations, such as the WordNet morphology described in Miller, George A, WordNet: A Lexical Database for English, in Communications of the ACM Vol. 38, No. 11, pages 39-41 (1995).

[00264] The topic analysis module 525 of Figure 21 includes a term frequency metric calculator 2105. In some implementations, the term frequency metric calculator 2105 may be capable of determining a term frequency metric for hypothesized words of the speech recognition lattices 2101. In some such implementations, the term frequency metric calculator 2105 may be capable of determining a term frequency metric for each noun observed in the input lattices (for example, the morphology unit 2104 may be capable of determining which hypothesized words are nouns).

[00265] In some implementations, the term frequency metric calculator 2105 may be capable of determining a term frequency metric according to a Term Frequency/Inverse Document Frequency (TF-IDF) function. In one such example, each time a hypothesized word with index x of a lexicon is detected in the input speech recognition lattices, the term frequency metric TF X may be determined as follows:

TFx = TFx' + (Equation 10)

N.max(ln DF X ,MDF) ^ '

[00266] In Equation 10, TF X ' represents the previous term frequency metric for the word x. If this is the first time that the word x has been encountered during the current iteration, the value of TF X ' may be set to zero. In Equation 45, DF X represents a document frequency metric and In indicates the natural logarithm. As noted above, the document frequency metric may correspond to an expected frequency with which a word will occur in the conference. In some examples, the expected frequency may correspond to a frequency with which the word has occurred in two or more prior conferences. In the case of a general business teleconference system, the document frequency metric may be derived by counting the frequency with which this word appears across a large number of business teleconferences.

[00267] Alternatively, or additionally, the expected frequency may correspond to a frequency with which the primary word candidate occurs in a language model. Various implementations of methods disclosed herein may be used in conjunction with a speech recognizer, which may apply some type of word frequency metric as part of its language model. Accordingly, in some implementations a language model used for speech recognition may provide the document frequency metric used by the term frequency metric calculator 2105. In some implementations, such information may be provided along with the speech recognition lattices or included with the conference metadata 210. [00268] In Equation 10, MDF represents a selected constant that indicates a minimum logarithmic document frequency. In some implementations, MDF values may be integers in the range of - 10 to -4, e.g., -6.

[00269] In Equation 10, C represents a word recognition confidence score in the range [0 - 1] as reported by the speech recognizer in the input lattice. According to some implementations, C may be determined according to:

C = exp(— C A — C L ) (Equation 1 1)

[00270] In Equation 11 , CA represents logarithmic acoustic cost and C L represents the logarithmic language cost, both of which are represented using the natural logarithm.

[00271] In Equation 10, N represents a number of word meanings. In some implementations, the value of N may be based on the number of definitions of the word in a standard lexicon, such as that of a particular dictionary.

[00272] According to some alternative implementations, the term frequency metric TF X may be determined as follows:

TFx = TFx' + g c+ (i -<— - (Equation 12)

N.max(ln DF X ,MDF) ^ '

[00273] In Equation 12, a represents a weight factor that may, for example, have a value in the range of zero to one. In Equation 10, the recognition confidence C is used in an un-weighted manner. In some instances, an un-weighted recognition confidence C could be non-optimal, e.g., if a hypothesized word has a very high recognition confidence but appears less frequently. Therefore, adding the weight factor a may help to control the importance of recognition confidence. It may be seen that when a = 1, the Equation 12 is equivalent to Equation 10. However, when = 0, recognition confidence is not used and the term frequency metric may be determined according the inverse of the terms in the denominator.

[00274] In the example shown in Figure 21 , the topic analysis module 525 includes an alternative word hypothesis pruning unit 2106. As the word list 2108 is created, the system notes a set of alternative word hypotheses for each word by analyzing alternative paths through the lattice for the same time interval.

[00275] For example, if the actual word spoken by a conference participant was the word pet, the speech recognizer may have reported put and pat as alternative word hypotheses. For a second instance of the actual word pet, the speech recognizer may have reported pat, pebble and parent as alternative word hypotheses. In this example, after analyzing all the speech recognition lattices corresponding to all the utterances in the conference, the complete list of alternative word hypotheses for the word pet may include put, pat, pebble and parent. The word list 2108 may be sorted in descending order of TF X .

[00276] In some implementations of the alternative word hypothesis pruning unit 2106, alternative word hypotheses appearing further down the list (for example, having a lower value of TF X ) may be removed from the list. Removed alternatives may be added to the alternative word hypothesis list 2107. For example, if the hypothesized word pet has a higher TF X than its alternative word hypotheses, the alternative word hypothesis pruning unit 2106 may remove the alternative word hypotheses pat, put, pebble and parent from the word list 2108 and add the alternative word hypotheses pat, put, pebble and parent to the alternative word hypothesis list 2107.

[00277] In this example, the topic analysis module 525 stores an alternative word hypothesis list 2107 in memory, at least temporarily. The alternative word hypothesis list 2107 may be input to the lattice rescoring unit 2102, as described elsewhere, over a number of iterations. The number of iterations may vary according to the particular implementation and may be, for example, in the range 1 to 20. In one particular implementation, 4 iterations produced satisfactory results.

[00278] In some implementations, the word list 2108 may be deleted at the start of each iteration and may be re-compiled during the next iteration. According to some implementations, the alternative word hypothesis list 2107 may not be deleted at the start of each iteration, so the alternative word hypothesis list 2107 may grow in size as the iterations continue.

[00279] In the example shown in Figure 21, the topic analysis module 525 includes a topic scoring unit 2109. The topic scoring unit 2109 may be capable of determining a topic score for words in the word list 2108.

[00280] In some examples, the topic score may be based, at least in part, on an ontology 2110 for topic generalization, such as the WordNet ontology discussed elsewhere herein. Accordingly, in some implementations generating the topic list may involve determining at least one hypernym of one or more words of the word list 2108. Such implementations may involve determining a topic score based, at least in part, on a hypernym score. In some implementations, the hypernyms need not have been spoken by a conference participant in order to be part of the topic score determination process.

[00281] For example, a pet is an example of an animal, which is a type of organism, which is a type of living thing. Therefore, the word "animal" may be considered a first-level hypernym of the word "pet." The word "organism" may be considered a second- level hypernym of the word "pet" and a first-level hypernym of the word "animal." The phrase "living thing" may be considered a third-level hypernym of the word "pet," a second- level hypernym of the word "animal" and a first-level hypernym of the word "organism."

[00282] Therefore, if the word "pet" is on the word list 2108, in some implementations the topic scoring unit 2109 may be capable of determining a topic score according to one of more of the hypernyms "animal," "organism" and/or "living thing." According to one such example, for each word on the word list 2108, the topic scoring unit 2109 may traverse up the hypernym tree N levels (here, for example, N=2), adding each hypernym to the topic list 2111 if not already present and adding the term frequency metric of the word to the topic score associated with the hypernym. For example, if pet is present on the word list 2108 with a term frequency metric of 5, then pet, animal and organism will be added to the topic list with a term frequency metric of 5. If animal is also on the word list 2108 with term frequency metric of 3, then the topic score of animal and organism will have 3 added for a total topic score of 8, and living thing will be added to the word list 2108 with a term frequency metric of 3.

[00283] According to some implementations, multiple iterations of a least some processes of method 2000 may include iterations of generating the topic list and determining the topic score. In some such implementations, block 2025 of method 2000 may involve including alternative word hypotheses in the alternative hypothesis list based, at least in part, on the topic score. For example, in some alternative implementations, the topic analysis module 525 may be capable of topic scoring based on the output of the term frequency metric calculator 2105. According to some such implementations, the alternative word hypothesis pruning unit 2106 may perform alternative hypothesis pruning of topics, in addition to alternative word hypotheses.

[00284] For example, suppose that the topic analysis module 525 had determined a conference topic of "pets" due to a term frequency metric of 15 for one or more instances of "pet," a term frequency metric of 5 for an instance of "dog" a term frequency metric of 4 for an instance of "goldfish." Suppose further that there may be a single utterance of "cat" somewhere in the conference, but there is significant ambiguity as to whether the is actual word spoken was "cat," "mat," "hat," "catamaran," "catenary," "caterpillar," etc. If the topic analysis module 525 had only been considering word frequencies in the feedback loop, then the word list 2108 would not facilitate a process of disambiguating these hypotheses, because there was only one potential utterance of "cat." However, because "cat" is a hyponym of "pet," which was identified as a topic by virtue of other words spoken, then the topic analysis module 525 may potentially be better able to disambiguate that potential utterance of "cat."

[00285] In this example, the topic analysis module 525 includes a metadata processing unit 2115. According to some implementations, the metadata processing unit 2115 may be capable of producing a bias word list 2112 that is based, at least in part, on the conference metadata 210 received by the topic analysis module 525. The bias word list 2112 may, for example, be capable of including a list of words that may be inserted directly into the word list 2108 with a fixed term frequency metric. The metadata processing unit 2115 may, for example, derive the bias word list 2112 from a priori information pertaining to the topic or subject of the meeting, e.g., from a calendar invitation, from email, etc. A bias word list 2112 may bias a topic list building process to be more likely to contain topics pertaining to a known subject of the meeting.

[00286] In some implementations, the alternative word hypotheses may be generated according to multiple language models. For example, if the conference metadata were to indicate that a conference may involve legal and medical issues, such as medical malpractice issues corresponding to a lawsuit based on a patient's injury or death due to a medical procedure, the alternative word hypotheses may be generated according to both medical and legal language models.

[00287] According to some such implementations, multiple language models may be interpolated internally by an ASR process, so that the the speech recognition results data received in block 2005 of method 2000 and/or the speech recognition lattices 2101 received in Figure 21 are based on multiple language models. In alternative implementations, the ASR process may output multiple sets of speech recognition lattices, each set corresponding to a different language model. A topic list 2111 may be generated for each type of input speech recognition lattice. Multiple topic lists 2111 may be may be merged into a single topic list 2111 according to the resulting topic scores.

[00288] According to some implementations disclosed herein, the topic list

2111 may be used to facilitate a process of playing back a conference recording, searching for topics in a conference recording, etc. According to some such implementations, the topic list 2111 may be used to provide a "word cloud" of topics corresponding to some or all of the conference recording.

[00289] Figure 24, which includes Figures 24A and 24B, shows an example of a user interface that includes a word cloud for an entire conference recording. The user interface 606a may be provided on a display and may be used for browsing the conference recording. For example, the user interface 606a may be provided on a display of a display device 610, as described above with reference to Figure 6.

[00290] In this example, the user interface 606a includes a list 2401 of conference participants of the conference recording. Here, the user interface 606a shows waveforms 625 in time intervals corresponding to conference participant speech.

[00291] In this implementation, the user interface 606a provides a word cloud

2402 for an entire conference recording. Topics from the topic list 2111 may be arranged in the word cloud 2402 in descending order of topic frequency (e.g., from right to left) until no further room is available, e.g., given a minimum font size.

[00292] According to some such implementations, a topic placement algorithm for the word cloud 2402 may be re-run each time the user adjusts a zoom ratio. For example, a user may be able to interact with the user interface 606a (e.g., via touch, gesture, voice command, etc.) in order to "zoom in" or enlarge at least a portion of the graphical user interface 606, to show a smaller time interval than that of the entire conference recording. According to some such examples, the playback control module 605 of Figure 6 may access a different instance of the conversational dynamics data files 515a-515n, which may have been previously output by the conversational dynamics analysis module 510, that more closely corresponds with a user-selected time interval.

[00293] Figure 25, which includes Figures 25A and 25B, shows an example of a user interface that includes a word cloud for each of a plurality of conference segments. As in the previous example, the user interface 606b includes a list 2401 of conference participants and shows waveforms 625 in time intervals corresponding to conference participant speech.

[00294] However, in this implementation, the user interface 606b provides a word cloud for each of a plurality of conference segments 1808A-1808J. According to some such implementations, the conference segments 1808A-1808J may have previously been determined by a segmentation unit, such as the segmentation unit 1804 that is described above with reference to Figure 18B. In some implementations, the topic analysis module 525 may be invoked separately for each segment 1808 of the conference (for example, by using only the speech recognition lattices 2101 corresponding to utterances from one segment 1808 at a time) to generate a separate topic list 2111 for each segment 1808.

[00295] In some implementations, the size of the text used to render each topic in a word cloud may be made proportional to the topic frequency. In the implementation shown in Figure 25A, for example, the topics "kitten" and "newborn" are shown in a slightly larger font size than the topic "large integer," indicating that the topics "kitten" and "newborn" were discussed more than the topic "large integer" in the segment 1808C. However, in some implementations the text size of a topic may be constrained by the area available for displaying a word cloud, a minimum font size (which may be user-selectable), etc.

[00296] Figure 26 is a flow diagram that outlines blocks of some playback control methods disclosed herein. The blocks of method 2600, like other methods described herein, are not necessarily performed in the order indicated. Moreover, such methods may include more or fewer blocks than shown and/or described.

[00297] In some implementations, method 2600 may be implemented, at least in part, via instructions (e.g., software) stored on non-transitory media such as those described herein, including but not limited to random access memory (RAM) devices, readonly memory (ROM) devices, etc. In some implementations, method 2600 may be implemented, at least in part, by an apparatus such as that shown in Figure 3A. According to some such implementations, method 2600 may be implemented, at least in part, by one or more elements of the playback system 609 shown in Figure 6, e.g., by the playback control module 605.

[00298] In this example, block 2605 involves receiving a conference recording of at least a portion of a conference involving a plurality of conference participants and a topic list of conference topics. In some implementations, as shown in Figure 6, block 2605 may involve receipt by the playback system 609 of individual playback streams, such as the playback streams 401B-403B. According to some such implementations, block 2605 may involve receiving other data, such as the playback stream indices 401A-403A, the analysis results 301C-303C, the segment and word cloud data 309, the search index 310 and/or the meeting overview information 311 received by the playback system 609 of Figure 6. Accordingly, in some examples block 2605 may involve receiving conference segment data including conference segment time interval data and conference segment classifications.

[00299] According to some implementations, block 2605 may involve receiving the conference recording and/or other information via an interface system. The interface system may include a network interface, an interface between a control system and a memory system, an interface between the control system and another device and/or an external device interface. [00300] Here, block 2610 involves providing instructions for controlling a display to make a presentation of displayed conference topics for at least a portion of the conference. In this example, the presentation includes images of words corresponding to at least some of the conference topics, such as the word cloud 2402 shown in Figure 24. In some implementations, the playback control module 605 may provide such instructions for controlling a display in block 2610. For example, block 2610 may involve providing such instructions to a display device, such as the display device 610, via the interface system.

[00301] The display device 610 may, for example, be a laptop computer, a tablet computer, a smart phone or another type of device that is capable of providing a graphical user interface that includes a word cloud of displayed conference topics, such as the graphical user interface 606a of Figure 24 or the graphical user interface 606b of Figure 25, on a display. For example, the display device 610 may be capable of executing a software application or "app" for providing the graphical user interface according to instructions from the playback control module 605, receiving user input, sending information to the playback control module 605 corresponding to received user input, etc.

[00302] In some instances, the user input received by the playback control module 605 may include an indication of a selected conference recording time interval chosen by a user, e.g., according to user input corresponding to a "zoom in" or a "zoom out" command. In response to such user input, the playback control module 605 may provide, via the interface system, instructions for controlling the display to make the presentation of displayed conference topics correspond with the selected conference recording time interval. For example, the playback control module 605 may select a different instance of a conversational dynamics data file (such as one of the conversational dynamics data files 515a-515e that are shown to be output by the conversational dynamics analysis module 510 in Figure 5) that most closely corresponds to the selected conference recording time interval chosen by the user and provide corresponding instructions to the display device 610.

[00303] If block 2605 involves receiving conference segment data, the display device 610 may be capable of controlling the display to present indications of one or more conference segments and to make the presentation of displayed conference topics indicate conference topics discussed in the one or more conference segments, e.g., as shown in Figure 25. The display device 610 may be capable of controlling the display to present waveforms corresponding to instances of conference participant speech and/or images corresponding to conference participants, such as those shown in Figures 24 and 25. [00304] In the example shown in Figure 26, block 2615 involves receiving an indication of a selected topic chosen by a user from among the displayed conference topics. In some examples, block 2615 may involve receiving, by the playback control module 605 and via the interface system, user input from the display device 610. The user input may have been received via user interaction with a portion of the display corresponding to the selected topic, e.g., an indication from a touch sensor system of a user's touch in an area of a displayed word cloud corresponding to the selected topic. Another example is shown in Figure 27 and described below. In some implementations, if a user causes a cursor to hover over a particular word in a displayed word cloud, instances of conference participant speech associated with that word may be played back. In some implementations, the conference participant speech may be spatially rendered and/or played back in an overlapped fashion.

[00305] In the example shown in Figure 26, block 2620 involves selecting playback audio data comprising one or more instances of speech of the conference recording that include the selected topic. For example, block 2620 may involve selecting instances of speech corresponding to the selected topic, as well as at least some words spoken before and/or after the selected topic, in order to provide context. In some such examples, block 2620 may involve selecting utterances that include the selected topic.

[00306] In some implementations, block 2620 may involve selecting at least two instances of speech, including at least one instance of speech uttered by each of at least two conference participants. The method may involve rendering the instances of speech to at least two different virtual conference participant positions of a virtual acoustic space to produce rendered playback audio data, or accessing portions of previously-rendered speech that include the selected topic. According to some implementations, the method may involve scheduling at least a portion of the instances of speech for simultaneous playback.

[00307] According to some implementations, block 2615 may involve receiving an indication of a selected conference participant chosen by a user from among the plurality of conference participants. One such example is shown in Figure 32 and described below. In some such implementations, block 2620 may involve selecting playback audio data that includes one or more instances of speech of the conference recording that include speech by the selected conference participant regarding the selected topic.

[00308] Here, block 2625 involves providing the playback audio data for playback on a speaker system. For example, the playback system 609 may provide mixed and rendered playback audio data, via the interface system, to the display device 610 in block 2625. Alternatively, the playback system 609 may provide the playback audio data directly to a speaker system, such as the headphones 607 and/or the speaker array 608, in block 2625.

[00309] Figure 27 shows an example of selecting a topic from a word cloud.

In some implementations, a display device 610 may provide the graphical user interface 606c on a display. In this example, a user has selected the word "pet" from the word cloud 2402 and has dragged a representation of the word to the search window 3105. In response, the display device may send an indication of the selected topic "pet" to the playback control module 605. Accordingly, this is an example of the "indication of a selected topic" that may be received in block 2615 of Figure 26. In response, the display device 610 may receive playback audio data corresponding to one or more instances of speech that involve the topic of pets.

[00310] Figure 28 shows an example of selecting both a topic from a word cloud and a conference participant from a list of conference participants. As noted above, a display device 610 may be providing the graphical user interface 606c on a display. In this example, after the user has selected the word "pet" from the word cloud 2402, the user has dragged a representation of the conference participant George Washington to the search window 3105. The display device 610 may send an indication of the selected topic "pet" and the conference participant George Washington to the playback control module 605. In response, the playback system 609 may send the display device 610 playback audio data corresponding to one or more instances of speech by the conference participant George Washington regarding the topic of pets.

[00311] When reviewing large numbers of teleconference recordings, or even a single recording of a long teleconference, it can be time-consuming to manually locate a part of a teleconference that one remembers. Some systems have been previously described by which a user may search for keywords in a speech recording by entering the text of a keyword that he or she wishes to locate. These keywords may be used for a search of text produced by a speech recognition system. A list of results may be presented to the user on a display screen.

[00312] Some implementations disclosed herein provide methods for presenting conference search results that may involve playing excerpts of the conference recording to the user very quickly, but in a way which is designed to allow the listener to attend to those results which interest him or her. Some such implementations may be tailored for memory augmentation. For example, some such implementations may allow a user to search for one or more features of a conference (or multiple conferences) that the user remembers. Some implementations may allow a user to review the search results very quickly to find one or more particular instances that the user is looking for.

[00313] Some such examples involve spatial rendering techniques, such as rendering the conference participant speech data for each of the conference participants to a separate virtual conference participant position. As described in detail elsewhere herein, some such techniques may allow the listener to hear a large amount of content quickly and then select portions of interest for more detailed and/or slower playback. Some implementations may involve introducing or changing overlap between instances of conference participant speech, e.g., according to a set of perceptually-motivated rules. Alternatively, or additionally, some implementations may involve speeding up the played- back conference participant speech. Accordingly, such implementations can make use of the human talent of selecting attention to ensure that a desired search term is found, while minimizing the time that the search process takes.

[00314] Accordingly, instead of returning a few results which are very likely to be relevant to the user's search terms and asking the user to individually audition each result (for example, by clicking on each result in a list, in turn, to play it), some such implementations may return many search results that the user can audition quickly (for example, in a few seconds) using spatial rendering and other fast playback techniques disclosed herein. Some implementations may provide a user interface that allows the user to further explore (for example, audition at 1: 1 playback speed) selected instances of the search results. However, some examples disclosed herein may or may not involve spatial rendering, introducing or changing overlap between instances of conference participant speech or speeding up the played-back conference participant speech, depending on the particular implementation. Moreover, some disclosed implementations may involve searching other features of one or more conferences in addition to, or instead of, the content. For example, in addition to searching for particular words in one or more teleconferences, some implementations may involve performing a concurrent search for multiple features of a conference recording. In some examples, the features may include the emotional state of the speaker, the identity of the speaker, the type of conversational dynamics occurring at the time of an utterance (e.g. a presentation, a discussion, a question and answer session, etc.), an endpoint location, an endpoint type and/or other features. A concurrent search involving multiple features (which may sometimes be referred to herein as a multi-dimensional search) can increase search accuracy and efficiency. For example, if a user could only perform a keyword search, e.g., for the word "sales" in a conference, the user might have to listen to many results before finding a particular excerpt of interest that the user may remember from the conference. In contrast, if the user were to perform a multi-dimensional search for instances of the word "sales" spoken by the conference participant Fred Jones, the user could have potentially reduced the number results that the user would need to review before finding an excerpt of interest.

[00315] Accordingly, some disclosed implementations provide methods and devices for efficiently specifying multi-dimensional search terms for one or more teleconference recordings and for efficiently reviewing the search results to locate particular excerpts of interest. Various modifications to the implementations described in this disclosure may be readily apparent to those having ordinary skill in the art. The general principles defined herein may be applied to other implementations without departing from the scope of this disclosure. Thus, the claims are not intended to be limited to the implementations shown herein, but are to be accorded the widest scope consistent with this disclosure, the principles and the novel features disclosed herein.

[00316] Various features and aspects will be appreciated from the following enumerated example embodiments ("EEEs"):

EEE1. An apparatus, comprising:

an interface system; and

a control system capable of:

receiving, via the interface system, speech recognition results data for at least a portion of an audio recording;

receiving an initial search query including at least one search word;

analyzing the initial search query according to phonetic similarity and semantic similarity;

determining an expanded search query according to the phonetic similarity, the semantic similarity, or both the phonetic similarity and the semantic similarity; and

performing a search of the speech recognition results data according to the expanded search query.

EEE2. The apparatus of claim EEE1, wherein the audio recording includes at least a portion of a recording of a conference involving a plurality of conference participants. EEE3. The apparatus of claim EEE2, wherein the speech recognition results data includes a plurality of speech recognition lattices and a word recognition confidence score for each of a plurality of hypothesized words of the speech recognition lattices, the word recognition confidence score corresponding with a likelihood of a hypothesized word correctly corresponding with an actual word spoken by a conference participant during the conference.

EEE4. The apparatus of any one of claims EEE1-EEE3, wherein analyzing the initial search query involves analyzing syllables and phonemes of the initial search query.

EEE5. The apparatus of any one of claims EEE1-EEE4, wherein determining the expanded search query involves a selecting process of determining candidate search query terms and selecting candidate search query terms to produce a refined search term list.

EEE6. An apparatus, comprising:

an interface system; and

control means for:

receiving, via the interface system, speech recognition results data for at least a portion of an audio recording;

receiving an initial search query including at least one search word;

analyzing the initial search query according to phonetic similarity and semantic similarity;

determining an expanded search query according to the phonetic similarity, the semantic similarity, or both the phonetic similarity and the semantic similarity; and

performing a search of the speech recognition results data according to the expanded search query.

EEE7. The apparatus of claim EEE6, wherein the audio recording includes at least a portion of a recording of a conference involving a plurality of conference participants.

EEE8. The apparatus of claim EEE7, wherein the speech recognition results data includes a plurality of speech recognition lattices and a word recognition confidence score for each of a plurality of hypothesized words of the speech recognition lattices, the word recognition confidence score corresponding with a likelihood of a hypothesized word correctly corresponding with an actual word spoken by a conference participant during the conference.

EEE9. The apparatus of any one of claims EEE6-EEE8, wherein analyzing the initial search query involves analyzing syllables and phonemes of the initial search query.

EEE10. The apparatus of any one of claims EEE6-EEE9, wherein determining the expanded search query involves a selecting process of determining candidate search query terms and selecting candidate search query terms to produce a refined search term list.

EEE11. A non-transitory medium having software stored thereon, the software including instructions for controlling one or more devices for:

receiving speech recognition results data for at least a portion of an audio recording; receiving an initial search query including at least one search word;

analyzing the initial search query according to phonetic similarity and semantic similarity;

determining an expanded search query according to the phonetic similarity, the semantic similarity, or both the phonetic similarity and the semantic similarity; and

performing a search of the speech recognition results data according to the expanded search query.

EEE12. The non-transitory medium of claim EEE11, wherein the audio recording includes at least a portion of a recording of a conference involving a plurality of conference participants.

EEE13. The non-transitory medium of claim EEE12, wherein the speech recognition results data includes a plurality of speech recognition lattices and a word recognition confidence score for each of a plurality of hypothesized words of the speech recognition lattices, the word recognition confidence score corresponding with a likelihood of a hypothesized word correctly corresponding with an actual word spoken by a conference participant during the conference.

EEE14. The non-transitory medium of any one of claims EEE11-EEE13, wherein analyzing the initial search query involves analyzing syllables and phonemes of the initial search query. EEE15. The non-transitory medium of any one of claims EEE11-EEE14, wherein determining the expanded search query involves a selecting process of determining candidate search query terms and selecting candidate search query terms to produce a refined search term list.