Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
MULTIMODAL INPUT-BASED DATA SELECTION AND COMMAND EXECUTION
Document Type and Number:
WIPO Patent Application WO/2022/104297
Kind Code:
A1
Abstract:
Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for using user inputs received via multiple input modalities/modes to select data (e.g., text) from documents and execute commands/instructions to perform operations based on the selected data. In one aspect, a first and second user inputs that are made using first and second input modalities, respectively, can be detected during display of a document on a computing device. A set of candidate text items can be generated based on the first user input, and based on the second user input, a particular text item can be selected from the set of candidate text items. An instruction, which specifies the particular text item, can be generated for execution on the computing device. The computing device can then be instructed to execute the generated instruction.

Inventors:
MILOTA ANDRÉ (US)
TONG WENHAN (US)
Application Number:
PCT/US2021/065792
Publication Date:
May 19, 2022
Filing Date:
December 30, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
FUTUREWEI TECHNOLOGIES INC (US)
International Classes:
G06F3/033; G06F40/00; G10L15/26
Foreign References:
US20160117146A12016-04-28
Attorney, Agent or Firm:
JHURANI, Karan (US)
Download PDF:
Claims:
CLAIMS

1. A computer-implemented method, comprising: detecting, during display of a document on a computing device, a first user input that is made using a first input modality and a second user input that is made using a second input modality that is different from the first input modality; generating, based on the first user input, a set of candidate text items; selecting, based at least on the second user input, a particular text item from among the set of candidate text items; generating an instruction for execution on the computing device, wherein the instruction specifies the particular text item; and instructing the computing device to execute the instruction.

2. The computer-implemented method of claim 1, wherein the first input modality is a spatial input modality and the second input modality is a linguistic input modality.

3. The computer-implemented method of claim 2, wherein: the first input includes one of: a touch input, a stylus-based input, or a contactless gesturebased input, wherein the contactless gesture-based input includes input based on eye-movement tracking, foot position tracking, or hand position tracking; and the second input includes one of: an audio input, a keyboard input, or input received via myoelectric sensors.

4. The computer-implemented method of any of claims 1-3, wherein detecting, during display of the document on the computing device, the first user input that is made using the first input modality and the second user input that is made using the second input modality that is different from the first input modality, comprising: detecting a touch input on the display of the computing device and within an area of text within the document, wherein the display of the computing device is a touch screen; and detecting speech in audio input received via a microphone of the computing device.

33

5. The computer-implemented method of any preceding claim, wherein selecting, based at least on the second user input, the particular text item from among the set of candidate text items, comprises: parsing the second input into a textual transcription; selecting, using the parsed textual transcription, a text analysis algorithm from among a plurality of text analysis algorithms; and using the selected text analysis algorithm to identify a second set of candidate text items from text included in the document.

6. The computer-implemented method of claim 5, wherein the selected text analysis algorithm is a date identifying algorithm, a time identifying algorithm, or a contact information identifying algorithm.

7. The computer-implemented method of claims 5 or 6, wherein the set of candidate text items is a first set of candidate text items, and wherein selecting, based at least on the second user input, the particular candidate text from among the set of candidate text items, comprises: correlating the first set of candidate text items with the second set of candidate text items; determining a score for each candidate text item in the second set of candidate text items based on a degree of correlation of the candidate text item with a candidate text item in the first set of candidate text items; and selecting, based on the scores of the candidate text items in the second set, the particular candidate text item.

8. The computer-implemented method of any preceding claim, wherein the first user input is a touch input and the computing device includes a touch screen, and wherein generating, based on the first user input, the set of candidate text items, comprises: determining, based on the touch input, an area of text included in the document displayed on the computing device; generating, based on text included in the area of text, a plurality of text items and a corresponding plurality of scores, wherein a score for a given text item indicates a likelihood that the given text item was intended to be selected by the touch input; and

34 selecting, based on the scores for the plurality of text items, the set of candidate text items.

9. The computer-implemented method of claim 8, wherein determining, based on the touch input, an area of text included in the document displayed on the computing device, comprises: identifying a contacted area that represents an area of the touch screen corresponding to the touch input; and determining the area of text included in the document that corresponds to the contacted area.

10. The computer-implemented method of claim 9, wherein generating, based on text included in the area of text, the plurality of text items and the corresponding plurality of scores, comprises: identifying the plurality of text items in the area of text included in the document, comprising: detecting a gesture corresponding to the touch input; and generating, using a set of rules corresponding to the detected gesture, the plurality of text items in the area of text included in the document.

11. The computer-implemented method of any preceding claim, wherein the touch input corresponds to one of: an underlining gesture, a circling gesture, or a one-tap touch gesture.

12. The computer-implemented method of any of claims 9-11, wherein generating, based on text included in the area of text, the plurality of text items and the corresponding plurality of scores, comprises: determining, for each text item in the plurality of text items, a proportional value representing a proportion of the text item that is included in the contacted area, wherein the generated score for a particular text item is based on the proportional value determined for the particular text item.

13. The computer-implemented method of any of claims 8-12, wherein selecting, based on the scores for the plurality of text items, the set of candidate text items, comprises: selecting, from among the plurality of text items, one or more text items that each has a score satisfying a particular threshold, wherein the set of candidate text items consists of the one or more text items.

14. The computer-implemented method of any preceding claim, wherein generating, based on the first and second user inputs, the instruction to execute on the computing device, comprises: parsing the second input into a textual transcription; and generating the instruction using one or more words included in the textual transcription and the particular text item.

15. The computer-implemented method of any preceding claim, wherein the computing device is a mobile device.

16. The computer-implemented method of any preceding claim, wherein instructing the computing device to execute the instruction, includes: instructing an application on the computing device to execute the instruction.

17. The computer-implemented method of claim 16, wherein the application that is instructed to execute the instruction is one of: an application other than an application within which the document is being provided for display on the computing device; or the application within which the document is being provided for display on the computing device.

18. The computer-implemented method of any preceding claim, wherein the instruction is an instruction to edit text in the document.

19. A system, comprising: at least one processor; and one or more memories coupled to the at least one processor and storing programming instructions for execution by the at least one processor to performs any of the preceding claims.

20. A non-transitory computer readable medium storing instructions for causing a computer system to perform the method of any of claims 1-18.

37

Description:
MULTIMODAL INPUT-BASED DATA SELECTION AND COMMAND EXECUTION

TECHNICAL FIELD

[0001] The present disclosure generally relates to data processing and techniques for using user inputs received via multiple input modalities/modes to select data (e.g., text) from documents and execute commands/instructions to perform operations based on the selected data.

BACKGROUND

[0002] A user generally interacts with computing devices, such as mobile devices that generally have reduced form factors (relative to desktop computers), using touch or another contact-based mechanism (e.g., stylus) or alternatively, a contactless interaction mechanism (e.g., gestures or eye gaze, which can be discerned by a camera or other sensor(s)). When using such a computing device, a user generally can use the contact or contactless interaction mechanisms to identify and select certain text or other objects (e.g., images) provided within user interfaces or other documents or pages displayed on the computing device. The user can then either manually initiate performance of a task using the identified data (e.g., launch a menu and selecting an option to save the image) or is prompted by the computing device to perform a particular task, which is then performed when the user selects the provided prompt (e.g., providing a prompt to copy and/or paste the selected text in the form of software buttons displayed on the device).

[0003] Contact-based techniques, such as finger touch, for identifying and selecting text or other objects on computing devices, particularly on reduced form factor devices (e.g., mobile devices with small displays), are often imprecise given the disparity between the smaller text (or other objects) provided on the display of these devices compared with the relatively large contact area corresponding to the area of the display touched by a user’s finger (this is often referred to as the “fat finger problem”).

[0004] The “fat finger problem” is illustrated using the following example where a user uses his/her finger to interact with a web page displayed on a touch- sensitive display of a mobile device. In this example, a user selects a few words of the text presented on the web page by dragging his finger across the device display (and over the intended text). However, when the user’s finger contacts the display, the size of the contacted area is generally larger than the font of the text, and as a result, the computing device (e.g., the display drivers and other software entities that detect the touch input and attempt to determine the selected text) is either unable to determine the precise text that the user intended to select or erroneously identifies text other than the text that the user intended to select. Similar issues arise in other contact-based interaction mechanisms or contactless interaction mechanisms, where the interaction mechanisms suffer from the same imprecision in text or object identification and selection.

SUMMARY

[0005] In general, one innovative aspect of the subject matter described in this specification can be embodied in methods including the operations of detecting, during display of a document on a computing device, a first user input that is made using a first input modality and a second user input that is made using a second input modality that is different from the first input modality; generating, based on the first user input, a set of candidate text items; selecting, based at least on the second user input, a particular text item from among the set of candidate text items; generating an instruction for execution on the computing device, wherein the instruction specifies the particular text item; and instructing the computing device to execute the instruction.

[0006] Other embodiments of this aspect include corresponding methods, apparatus, and computer programs, configured to perform the actions of the methods, encoded on computer storage devices. These and other embodiments can each optionally include one or more of the following features.

[0007] In some implementations, the first input modality is a spatial input modality and the second input modality is a linguistic input modality.

[0008] In some implementations, the first input can include one of: a touch input, a stylus-based input, or a contactless gesture-based input, wherein the contactless gesture-based input includes input based on eye-movement tracking, foot position tracking, or hand position tracking.

[0009] In some implementations, the second input can include one of: an audio input, a keyboard input, or input received via myoelectric sensors.

[0010] In some implementations, detecting, during display of the document on the computing device, the first user input that is made using the first input modality and the second user input that is made using the second input modality that is different from the first input modality, can include detecting a touch input on the display of the computing device and within an area of text within the document, wherein the display of the computing device is a touch screen; and detecting speech in audio input received via a microphone of the computing device. [0011] In some implementations, selecting, based at least on the second user input, the particular text item from among the set of candidate text items, can include parsing the second input into a textual transcription; selecting, using the parsed textual transcription, a text analysis algorithm from among a plurality of text analysis algorithms; and using the selected text analysis algorithm to identify a second set of candidate text items from text included in the document.

[0012] In some implementations, the selected text analysis algorithm is a date identifying algorithm, a time identifying algorithm, or a contact information identifying algorithm.

[0013] In some implementations, the set of candidate text items can be a first set of candidate text items, and selecting, based at least on the second user input, the particular candidate text from among the set of candidate text items can include correlating the first set of candidate text items with the second set of candidate text items; determining a score for each candidate text item in the second set of candidate text items based on a degree of correlation of the candidate text item with a candidate text item in the first set of candidate text items; and selecting, based on the scores of the candidate text items in the second set, the particular candidate text item.

[0014] In some implementations, the first user input can be a touch input and the computing device can include a touch screen.

[0015] In some implementations, generating, based on the first user input, the set of candidate text items, can include determining, based on the touch input, an area of text included in the document displayed on the computing device; generating, based on text included in the area of text, a plurality of text items and a corresponding plurality of scores, wherein a score for a given text item indicates a likelihood that the given text item was intended to be selected by the touch input; and selecting, based on the scores for the plurality of text items, the set of candidate text items.

[0016] In some implementations, determining, based on the touch input, an area of text included in the document displayed on the computing device, can include identifying a contacted area that represents an area of the touch screen corresponding to the touch input; and determining the area of text included in the document that corresponds to the contacted area.

[0017] In some implementations, generating, based on text included in the area of text, the plurality of text items and the corresponding plurality of scores, can include identifying the plurality of text items in the area of text included in the document, which can further include detecting a gesture corresponding to the touch input; and generating, using a set of rules corresponding to the detected gesture, the plurality of text items in the area of text included in the document.

[0018] In some implementations, the touch input corresponds to one of: an underlining gesture, a circling gesture, or a one-tap touch gesture.

[0019] In some implementations, generating, based on text included in the area of text, the plurality of text items and the corresponding plurality of scores, can include determining, for each text item in the plurality of text items, a proportional value representing a proportion of the text item that is included in the contacted area, wherein the generated score for a particular text item is based on the proportional value determined for the particular text item.

[0020] In some implementations, selecting, based on the scores for the plurality of text items, the set of candidate text items, can include selecting, from among the plurality of text items, one or more text items that each has a score satisfying a particular threshold, wherein the set of candidate text items consists of the one or more text items.

[0021] In some implementations, generating, based on the first and second user inputs, the instruction to execute on the computing device, can include parsing the second input into a textual transcription; and generating the instruction using one or more words included in the textual transcription and the particular text item.

[0022] In some implementations, the computing device can be a mobile device.

[0023] In some implementations, instructing the computing device to execute the instruction, can include instructing an application on the computing device to execute the instruction.

[0024] In some implementations, the application that is instructed to execute the instruction can be one of: an application other than an application within which the document is being provided for display on the computing device; or the application within which the document is being provided for display on the computing device.

[0025] The subject matter described in this specification can be implemented in particular embodiments to realize one or more of the following advantages. For example, the techniques described in this specification enable precise and reliable data selection from a digital document displayed on a reduced form factor device. As described above, software services that only utilize touch or gesture-based selection techniques are generally unable to determine the precise text or other object(s) selected by the user using touch/gestures. The techniques described herein enable a more precise determination (relative to the single input gesture-based techniques) of the text or other object(s) intended to be selected by using inputs from two or more input modalities (e.g., a first input based on touch, gestures, etc. and a second input based on verbal commands spoken by a user or written commands typed by a user).

[0026] As used in this specification, an input modality (also referred to herein as “input mode” or simply as “mode” or “modality”) refers to a communication channel or medium via which a user interacts or communicates with a computing device. Examples of input modalities include, among others, a spatial modality and a linguistic modality. A spatial modality specifies a communication channel or medium via which a user uses spatial movements to interact with a computing device, such as, e.g., user interactions provided using a finger touch and/or gestures that can be detected by touch-sensitive displays and/or other sensors (e.g., a camera’s image sensor), respectively, and the associated device services (e.g., device drivers), as applicable. Additional examples of spatial modalities include contact-based gesture input entered using a non-finger contact device (e.g., a stylus) and processed by device services (e.g., a display driver), and contactless gestures (e.g., eye /hand/feet/head movements) detected by a device sensor (e.g., a camera, motion sensor) and processed by associated device services (e.g., image processing engines of the device). A linguistic modality specifies a communication channel or medium via which a user interacts with a computing device using words, such as, e.g., words typed on a keyboard and detected and processed by a keyboard driver, oral/spoken utterances of words that are detected and processed, e.g., by a microphone and associated audio drivers. Other examples of linguistic modalities include words detected by a device, which are communicated/entered using a keypad or another keyboard alternative (e.g., keyboard alternative pedals) and words detected and processed by myoelectric sensors and associated device drivers.

[0027] In addition to enabling precise and reliable data selection from a digital document displayed on a reduced form factor device, the techniques described herein can also be used to precisely select text from documents that include unstructured text and which otherwise do not provide any metadata or other indicators to aid in the identification of text that the user intended to select.

[0028] Additionally, by enabling a more precise identification of the text/data that the user intended to select, the techniques described herein can be resource and time efficient in that they can reduce the need for additional interactions (and the underlying resources utilized for those interactions) that otherwise stem from a user having to repeatedly specify the text/data that the user intended to select. [0029] Further still, the techniques described herein improve the usability of reduced form factor devices (e.g., smartphones or other devices with limited screen real estate) and improve user productivity when using such devices, especially in usage scenarios (e.g., text or object editing) where such devices fall short because of their limited screen size (as described above). Nevertheless, these techniques can also be used with larger form factor devices when desirable.

[0030] The techniques described herein also enable a more natural human-computer interaction by enabling performance of instructions in a manner akin to how humans normally communicate with each other. For example, these techniques enable use of gestures and spoken commands that reference text and/or other objects in documents in terms of their semantic meaning. Based on these gestures and commands, the techniques described herein can enable selection of text and performance of certain operations specified by the gesture and/or spoken command.

[0031] Relatedly, by utilizing multiple input modalities, the techniques described herein enable precise text (or other data) selection even if less precise inputs are received from any or all of the input modalities. For example, the techniques described herein can accept less precise gestures for text selection (compared to conventional text selection techniques) and still identify and select the text that the user intended to select with higher precision than conventional gesture-based text selection techniques. Similarly, the techniques described herein can accept less comprehensive audio inputs (e.g., an oral input including words orally uttered by a user) and still identify and select the text that the user intended to select with higher precision than conventional speech-based selection techniques (which, e.g., generally require the user to read aloud the entire text that the user intends to select). In some implementations, the techniques described herein enable a user to provide an imprecise gesture (e.g., a touch on a portion of text displayed on the device, where the touched area encompasses multiple words) and an imprecise audio command (“select this time”), such that the two inputs can be processed to identify and select the text that the user intended to select with higher precision than conventional text selection techniques.

[0032] The details of one or more embodiments of the subject matter of this specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims. BRIEF DESCRIPTION OF THE FIGURES

[0033] FIG. 1 is a block diagram of an example computing environment including user devices communicating with web servers over a network and performing multiple computing tasks.

[0034] FIG. 2A-2D show operational and structural details of an example digital assistant that uses inputs from multiple input modalities to select text from a digital document and perform operations based on the selected text.

[0035] FIG. 3 is a flow diagram of an example process for using inputs from multiple input modalities to select text from a digital document and perform operations based on the selected text. [0036] FIG. 4 is a schematic diagram of a general-purpose network component or computer system.

[0037] Like reference numbers and designations in the various drawings indicate like elements.

DETAILED DESCRIPTION

[0038] FIG. 1 is a block diagram of an example computing environment including user devices communicating with web servers over a network and performing multiple computing tasks.

[0039] As shown, the example environment 100 includes a network 104, which can include a local area network (LAN), a wide area network (WAN), the Internet or a combination thereof. The network 104 can also comprise any type of wired and/or wireless network, satellite networks, cable networks, Wi-Fi networks, mobile communications networks (e.g., 3G, 4G, and so forth) or any combination thereof. The network 104 can utilize communications protocols, including packetbased and/or datagram-based protocols such as internet protocol (IP), transmission control protocol (TCP), user datagram protocol (UDP), or other types of protocols. The network 104 can further include a number of devices that facilitate network communications and/or form a hardware basis for the networks, such as switches, routers, gateways, access points, firewalls, base stations, repeaters or a combination thereof. The network 104 connects client devices 102 and web servers 106. The example environment 100 may include many different web servers 106 and user devices 102.

[0040] A web server 106 is a computing platform that can send content to and receive content from, e.g., one or more user devices 102, over the network 104. Example web servers 106 include search engines, social media servers, news servers, email servers, or any other content platform. For example, an email server can communicate over network 104 with a user device 102 to provide data about received emails to the user device. As another example, a social media server can communicate over network 104 with a user device 102 to provide data about social media posts for a particular user’s social media account accessed by a user device 102.

[0041] A user device 102 (also referred to herein as computing device 102 or simply as device 102) is an electronic device that is capable of requesting and receiving content over the network 104, as well as performing various tasks directly on the user device 102 (e.g., document drafting, playing games, etc.). Example user devices 102 include personal computers, mobile communication devices, mobile phones, tablets, laptops, smartwatches, digital assistant devices, and other devices that can send and receive data over the network 104.

[0042] An example user device 102A typically comprises an operating system 122 that is generally responsible for managing the device hardware such as device storage 124 and software resources, such as applications 114. The client device 102A can include various types of applications. Some examples of such applications (and as shown in FIG. 1) include a web browser 128, mail 116, notes 118, and messages 126. The web browser 128 facilitates the receiving and displaying of web pages, such as those provided by a web server 106 over a network 104. The mail application 116 enables a user to send and receive emails over a network 104 by communicating with a mail server (another type of web server 106). The notes application 118 can be any word editing software that allows a user of user device 102 to perform various word editing tasks, such as, e.g., drafting notes and copying and pasting text, images, or other objects. The messages application 126 enables a user to send and receive text messages or instant messages with one or more contacts. [0043] As shown in FIG. 1, the example user device 102A also includes a digital assistant 120, which can assist a user of the user device 102A to perform various tasks in response to commands that are generally issued verbally by the user. For example, digital assistants commonly provide oral and/or written answers from web resources based on a user’s verbal questions (e.g., “what’s the weather?”, “where is Buckingham Palace?”) as well as assist in performing daily tasks (e.g., setting reminders and generating calendar events, in response to receiving details for such reminders and calendar events in the user’s verbal command). In some implementations, the digital assistant 120 can be triggered or launched, e.g., upon pressing a button on the user device 102 or upon a user saying a particular catch phrase. Upon being launched, the digital assistant 120 provides an interface that interfaces with, e.g., a microphone on the user device 102, to detect a voice command issued by the user, parses the command to identify the task requested in the command, performs the identified task, and provides a response to the user, such as requesting additional details from the user (e.g., “what time should I set the appointment for?”), providing confirmation that the task is being performed or has been performed (e.g., “a timer for 5 minutes has been set,” “reminder has been set for April 11”), or providing information in response to user’s request (e.g., “here is information I found on the Web for your question,” followed by a list of search results).

[0044] As described further with reference to FIGS. 2A-2D and 3, the digital assistant 120 can be further configured to utilize inputs received via multiple input modalities — e.g., commands received via a linguistic input channel (e.g., an audio/verbal input, text entered via keyboard/keypad or a keyboard alternative, myoelectric sensors, keyboard alternatives pedals or other alternative inputs) and user gestures received via a spatial input channel (e.g., a touch input, an input using a device such as a stylus, or a contactless gesture captured by a device sensor (e.g., a camera, motion sensor)) — to select the text or other data from a data source (e.g., a web page displayed in browser 114, a page in the notes application 118) and perform an operation based on the selected data.

[0045] FIGS. 2A-2D show operational and structural details of an example digital assistant that uses inputs received using multiple input modalities to select text from a digital document and perform operations based on the selected text. FIGS. 2A and 2B depict example scenarios that illustrate the high functionality of such a digital assistant and FIG. 2C depicts the structural aspects of such a digital assistant that enables the functionality shown in FIGS. 2A and 2B. FIG. 2D depicts an additional example scenario in which an example digital assistant uses inputs received using multiple input modalities to select text from a digital document and performs an operation that modifies the digital document based on the user’s command and the selected text.

[0046] In the following description, a digital assistant (e.g., digital assistant 120) is described as being configured to utilize user input in multiple input modes to select certain text from a document and performing operations based on the selected text. However, one skilled in the art will understand that the structural components of the digital assistant 120, as shown in FIG. 2C, can also be implemented as a standalone software package. In some implementations, such a software package could be executed by a computing device, e.g., as a standalone application (e.g., application 114). Alternatively, or additionally, such a software package can be implemented as part of an existing application (e.g., a mail application or a document editor application) that provides a multimodal user interface, which is an interface that accepts multiple modes of user input and uses these inputs to perform operations requested by the user input(s). Similarly, the components can also be implemented as separate applications, and even further on separate devices in communication with each other.

[0047] FIGS. 2A and 2B are described first below to illustrate, using examples, the high-level functionality of the digital assistant 120, followed by a description of the structural aspects of the digital assistant in FIG. 2C, which enable the example functionality described with reference to FIGS. 2A and 2B.

[0048] In FIGS. 2A and 2B, a user of a user device 102 is shown as viewing an email in an email application (e.g., mail application 116). As shown, the email being displayed on the user device 102 was sent from loe to Jennifer (as indicated by the “From and “To” fields, respectively), and includes the subject “Lunch” (as indicated by the “Subject” field”). Also, as shown in FIGS. 2A and 2B, the body of the email reads: “Want to meet for lunch? I was thinking we could go to that Johnny’s Bar at noon. Btw, send me the name of that show that you were watching yesterday. See you soon.”

[0049] One skilled in the art will appreciate that the body of the email shown in FIGS. 2A and 2B includes unstructured text (as opposed to semi-structured text or structured text). Semi -structured text or structured text generally includes metadata, labels, tags, or other identifiers that allow portions of such text to be identified or categorized into, e.g., semantically labeled fields. In contrast, unstructured text does not have any tags or other identifiers that structure portions of text, e.g., into semantically labeled fields; rather, such unstructured text is viewed by a computing system as a homogeneous vector of characters. For example, in FIGS. 2A and 2B, the text “Joe”, “Jennifer”, and “Lunch” constitute structured or semi-structured text because they are identified and organized using semantically labeled fields, such as “From”, “To”, and “Subject”, respectively. In contrast, the body of the email shown in FIGS. 2A and 2B (i.e., the portion shown after the labeled fields) includes unstructured text given that this text does not include any metadata, labels, or identifiers that allow the body text to be identified or categorized into, e.g., semantically labeled fields.

[0050] In both FIGS. 2A and 2B, the user is shown using his finger 220 to select a portion of the text in the body of the email. As shown, the contact area corresponding to the user’s finger encompasses text in multiple lines. As such, it is unclear from the contacted area whether the user is intending to select, e.g., a position between any of the letters towards the middle of each of these lines, one of the letters found in this region, one of the words in the vicinity of the touch, a phrase or a sentence that has some characters, an entire paragraph or the entire email.

[0051] In addition to the touch input, the user also issues a verbal command (as shown by item 210 in both FIGS. 2A-2B). For example, as shown in FIG. 2A, the user says, “remind me to send this to Joe” while pointing to the text, which as described above, could be ambiguously pointing to one or more words. Given the command to “remind ... to send” Joe something that’s being gestured and given the context of the email (where Joe is asking Jennifer to send the name of the show), the digital assistant 120 determines that the user intended to select the text: “name of the show that you were watching.”

[0052] Alternatively, as shown in FIG. 2B, the user can say, “remind me to meet Joe for lunch at this time.” Again, the user can say this command while pointing to the text, which as described above, could be pointing to ambiguously pointing to one or more words in the email. Given the command to “remind ... to meet Joe for lunch at” a time that’s being gestured by the user’s touch and given the context of the email (where Joe is asking Jennifer to meet Joe for lunch at noon), the digital assistant 120 determines that the user intended to select the word “noon.”

[0053] Additionally, the user’s verbal input is used by the digital assistant 120 to determine the command, which in both FIGS. 2A and 2B, is to generate a reminder. The digital assistant 120 then creates a reminder (e g., using the Reminder application loaded on the user device 120) to “send the name of that show to Joe” (in the FIG. 2A scenario) or “to meet Joe for lunch at 12:00 pm” (in the FIG. 2B scenario). As described with reference to FIG. 1, the digital assistant 120 can perform any number of computing operations (e.g., creating calendar events, performing word editing functions, making phone calls, sending text messages, etc.) and is not limited to creating reminders (which is simply used for illustrative purposes in FIGS. 2A and 2B).

[0054] As illustrated by the above descriptions of FIGS. 2A and 2B, the user’s touch-based input did not precisely identify the text that the user intended to select. However, in both the FIGS. 2A and 2B scenarios, the user’s verbal input supplemented the user’s touch-based input, and the two inputs collectively enabled the digital assistant 120 to select the precise text that the user intended to select and to use this text in performing the operation(s) instructed by the user. The structure and operations of the digital assistant 120 that enable these (and additional) functions are described with reference to FIGS. 2C. [0055] As shown in FIG. 2C, the digital assistant 120 includes six components: speech recognizer 230, utterance interpreter 280, gesture analyzer 240, text analyzer 250, command identifier 250, and command executor 270. Each of these components includes a set of programmatic instructions (written in any suitable programming language) that can be executed by a device processor or another data processing apparatus (which is further described below) to enable the performance of their respective operations that are described below. Although shown as six different components, one skilled in the art will appreciate that fewer or more components can be used to enable the device’s performance of the below-described operations. For example, the utterance interpreter 280 and the command identifier 260 can be implemented as a single component that performs processing of the textual transcription of the spoken utterance to both determine a configuration (e.g., a text analysis algorithm) that can be used for text analysis performed by the text analyzer 250 and to identify a command that the user wants the computing device 102 to perform. For ease of reference and brevity, the following description of the operation of these components is based on the implementation of these components as part of the digital assistant 120, as shown in FIG. 2C.

[0056] The operation of these components is summarized here and described in greater detail in the subsequent paragraphs. A user using a computing device, which is being used to display a document, can gesture, e.g., by touching the device display, and speak a command. The gesture analyzer 240 can identify a portion of the document indicated by the user’s gesture (e.g., the user's touch) and the text analyzer 250 can generate a first set of candidate text items based on text included in the area indicated by the user’ s gesture. The user’ s spoken utterance can be transcribed by the speech recognizer 230 into a textual transcript (a textual sequence of words) and sent to the utterance interpreter 280. The utterance interpreter 280 processes this textual transcript to determine a text analysis algorithm, which is used by the text analyzer 250 to search for particular types of text within the text of the document (e.g., the entirety of the text or a portion thereof) displayed on the device and identify, based on such search, a second set of candidate text items. The first and second set of candidate text items can then be correlated to identify a particular text item that the user most likely intended to select, which is then provided to the command identifier 260. The command identifier 260 identifies and generates a command based on the textual transcription of the user’s spoken utterance and the selected particular text item. The command executor then executes the generated command. [0057] Spatial Input Detection and Analysis

[0058] As shown in FIG. 2C, a user identifies text displayed in a document/page using a spatial input mode, which in the case of FIG. 2C (as well as in FIGS. 2A and 2B), is a touch input. However, the user’s spatial input is not limited to touch and can include any number of spatial input types, such as, e.g., an input made using a device such as a stylus or a joystick, or a contactless gesture-based input (e.g., contactless gestures such as eye, foot, and/or hand movements that are captured and parsed by one or more device sensors (e.g., a camera, motion sensor) and corresponding device drivers). There are an even wider variety of spatial input devices available including: pen, mouse, hand tracking systems, joysticks eye trackers, and brain computer interfaces. For brevity and ease of description, the following description of FIG. 2C is based on finger touch-based input.

[0059] When the user touches the screen, the touch is detected (e.g., by a touch sensor and a corresponding device driver) and is provided to a gesture analyzer 240, which uses the contacted area corresponding to the detected touch to identify the portion of the page/document that was contacted/touched by the user’s finger. In some implementations, the gesture analyzer 240 provides data identifying the portion of the page/document that was contacted/touched by the user’s finger (which can be, e.g., data specifying coordinates of the device screen or coordinates of the identified portion with reference to the underlying page/document) to the text analyzer 250. [0060] The text analyzer 250 uses the received data identifying the portion of the page/document that was contacted/touched by the user’s finger and processes the text included in this identified portion of the page/document to generate a set of hypotheses for the text (also referred to herein as candidate text items) that the user’s touch/gesture intended to select. In some implementations, the text analyzer 250 can be implemented as a rules-based engine that can use a set of developer/user-specified rules to identify the candidate text items (which include, e.g., one or more characters/words, or other text items (e.g., numbers)) included within the contacted area/identified portion of the page/document.

[0061] In some implementations, a set of rules for identifying multiple text items could be specified by, e.g., a developer, for each type of gesture that may be used to identify the text that the user intended to select. For example, one set of rules can be used to identify multiple text items corresponding to a gesture indicating underlining (e.g., sliding the finger across the device display in a straight line). In this example, the text analyzer 250 can use the set of rules to identify/generate multiple text items (e.g., different sequences of words) using the text in the contacted area (spanning the line or lines of text generally corresponding to the line formed by the user sliding the finger across the display) and the surrounding areas (e.g., a set number of lines (such as 2-3 lines) above and below the area contacted by the user’s finger). As another example, another set of rules can be used to detect multiple text items corresponding to a gesture indicating circling (e.g., sliding the finger across the device display in a circular motion). In this example, the text analyzer 250 can use the rules to identify/generate multiple text items (e.g., sequences of words, individual words, etc.) using the text in the contacted area (spanning the line generally encompassed by the circling gesture) and the surrounding areas (e.g., a set number of lines (such as 2-3 lines) above and below the area contacted by the user’s finger). As another example, another set of rules can be used to detect multiple text items corresponding to a gesture indicating a one-point tap/touch (e.g., a finger press on one area followed by the finger being released from that area). In this example, the text analyzer 250 can use the rules to identify/generate multiple text items, e.g., one or more words within or around the contacted area, one or more sentences within or around the contacted area, the paragraph(s) including the contacted area, etc.

[0062] In some implementations, the text analyzer 250 can include an additional set of rules to identify, from among the multiple text items (as identified above, e.g., based on the user’s gesture(s)), a set of candidate text items. One such set of rules can determine whether each of the multiple text items is wholly or partially encompassed within the contacted area, and assign a score (e.g., a score ranging from 0 to 1) to each text item based on how much of the candidate text item is encompassed within the contacted area/identified portion of the page/document. In this example, a text item that is fully or substantially encompassed within the contact area will be assigned a higher score compared to a text item where only a portion of that text item (e.g., a single letter of a word) is included within the contacted area. In this example, the score can be determined by determining a proportional value representing the proportion of the identified text item that is included in the contacted area, where the proportional value can range from 0 to 1, with 0 indicating that the text item is not included in the contacted area and 1 indicating that the text item is fully encompassed by the contact area). In some implementations, the text analyzer 250 can select the set of candidate text (which is also referred to herein as the first set of candidate text items) items based on the scores of the multiple text (e.g., by selecting the text items for which the scores satisfy (e.g., meet or exceed) a predetermined threshold or by selecting N candidate text items with the top N scores). Alternatively, each of the multiple text items can make up the set of candidate text items.

[0063] Alternatively, or additionally, the text analyzer 250 can implement a machine learning model that can use supervised or unsupervised techniques to determine scores for the multiple text items included in the identified portion/area of the page/document. One such machine learning model implementation can utilize a supervised approach in which the model is trained to process the text included in the identified portion/area of the page/document and in some cases, the type of gesture that the user applied (e.g., underlining, circling, one tap touch) to identify multiple text items and output a score for each such text item, where the score indicates a likelihood that the user intended to select a particular text item. In some implementations, the model can be trained using a set of training data that includes identified portions/areas of multiple pages/document and the associated gestures, and labels identifying the candidate text items included in each respective identified portion/area. Moreover, and as described in the example rules-based system implementation above, the model can be trained to output higher scores for text items that are wholly or substantially included within the contacted area. In some implementations, the text analyzer 250 can then select the first set of candidate text items based on the scores of the multiple text items (e.g., by selecting the text items for which the scores satisfy (e.g., meet or exceed) a predetermined threshold or by selecting N candidate text items with the top N scores).

[0064] Linguistic Input Detection and Analysis

[0065] In some implementations, the first set of candidate text items for the selected text (as output by the text analyzer 250) can be refined or narrowed using an input 210 of a linguistic input modality (e.g., an audio input specifying words spoken by a user), as further described below. Although FIGS. 2A-2C depict a linguistic input in the form of verbal/audio input received from the user, the techniques described here are not limited to this particular linguistic input and in fact, can include any other linguistic input types. Additional examples of linguistic input types include, without limitation, typing input received from keyboard, keypads, or other keyboard alternatives (e.g., keyboard pedals), handwritten input received from touch pads that can detect and parse handwritten input, and inputs received from myoelectric sensors. For brevity and ease of description, the following description of FIG. 2C utilizes verbal/audio input in the form of words spoken by a user.

[0066] In some implementations, the linguistic input is converted into a textual transcription (i.e., a transcription of the spoken utterance into a textual sequence of words). In the implementation shown in FIG. 2C (and as explained further with reference to FIGS. 2A and 2B), verbal input is detected (e.g., by an audio driver of the user device 102) and is provided to the speech recognizer 230. The speech recognizer 230 implements a speech recognition algorithm to detect the speech in the verbal/audio input and convert the speech into the textual transcription. In such implementations, various different speech recognition algorithms can be implemented by the speech recognizer 230, including, among others, Hidden Markov models (HMM) and neural network-based speech recognition pipelines, etc.

[0067] One skilled in the art will appreciate that in implementations where non-spoken input types are used, the corresponding input would be processed by an appropriate input detection apparatus to determine the textual sequence of words. For example, when a user provides handwritten input (e.g., as drawn using a pen and a touch-sensitive drawing pad area), a handwriting processing apparatus can be provided (in place of or in addition to the speech recognizer 230) to process the handwritten text and convert it into a digital, textual transcription of the handwritten text. For brevity and ease of description, the following description of FIG. 2C is based on spoken/audio input for which a textual transcription is generated by a speech recognizer 230.

[0068] The speech recognizer 230 can provide the textual transcription to the utterance interpreter 280 as well as the command identifier 260, and the operations of each of these components is provided below.

[0069] The utterance interpreter 280 takes, as input, the textual transcription output by the speech recognizer 230. The utterance interpreter 280 can be configured to process the textual transcription to select a text analysis algorithm from among multiple text analysis algorithms (as further described below). Examples of text analysis algorithms include, among others, date identifying algorithms, time identifying algorithms, and address or contact information identifying algorithms. One skilled in the art will appreciate that any number of text algorithms can be implemented to identify and select particular types of text that include a particular structure or hierarchy (e.g., currency amounts generally include a string of numbers that are immediately preceded by a currency sign, such as $; numbers with units generally include string of numbers that are by the text (full form or abbreviation) of the units).

[0070] The utterance interpreter 280 can apply a set of rules (e.g., rules written by a developer or learned by a neural network or machine learning system) to select the particular text analysis algorithm using the one or more words included in the textual transcription output by the speech recognizer 230.

[0071] In some implementations, the utterance interpreter 280 can be implemented as part of a rules-based engine that uses one or more user/developer-specified rules to select a particular text analysis algorithm from among multiple text analysis algorithms and using the text included in the textual transcription. For example, one or more such rules can specify that a time analysis algorithm should be selected, e.g., when the textual transcription (output by the speech recognizer 230) includes certain time-related words (e.g., “time,” “when,” “start,” “end,” “schedule”). Accordingly, in an example scenario where a user says “schedule a meeting for this time,” the utterance interpreter 280 evaluates the above-specified example rule and based on the usage of the word “time” and “schedule” in the textual transcription, the utterance interpreter 280 determines that a time identifying algorithm should be selected.

[0072] In some implementations, in addition to the textual transcription or instead of the textual transcription, the utterance interpreter 280 can use as input the first set of candidate text items (as determined by the text analyzer 250 based on the touch input — described above) and additionally evaluate these candidate text items against the types of rules described above to determine/select a particular text analysis algorithm.

[0073] In some implementations, the utterance interpreter 280 can be a rules-based parser, in which multiple rules can be specified by a user/developer, a neural network parser, or a hybrid parser that utilizes user-specified rules and rules that are learned when a neural network (or another machine learning model) is implemented. When implemented as a rules-based parser, the utterance interpreter 280 can use a dynamic programming algorithm or a back tracking algorithm. In implementations where machine learning techniques are used, the utterance interpreter 280 can be implemented using a machine learning model that is trained to accept these features/parameters related to the textual transcription output by the speech recognizer 230 and generate, based on these inputs, a score for each text analysis algorithm that indicates a likelihood of whether the particular text analysis algorithm can identify the type of text that the user intended to select. The utterance interpreter 280 can then be configured to select the text analysis algorithm with the highest score.

[0074] In some implementations, the utterance interpreter 280 generates a script, program, or executable data structure that specifies the selected text analysis algorithm, and this script, program, or executable program is provided to the text analyzer 250.

[0075] The text analyzer 250 uses the text analysis algorithm (as included, e.g., in the script, program, or executable program provided by the utterance interpreter 280), and uses this text analysis algorithm to identify a second set of candidate text items from the text included in the displayed document (as further described below).

[0076] In some implementations, the text analysis algorithm is used by the text analyzer 250 to search text of the document (which can be processed by the text and included, e.g., in a data store/structure) and identify the second set of candidate text items from the document text. In some implementations, the entire document text displayed on the user device 102 or a portion of the document displayed on the user device 102 can be searched. A portion of the document can include text within and in the vicinity of the detected touch, which, e.g., can be specified by identifying lines of text that are encompassed by the detected, contacted areas as well as a certain number of lines above and/or below the contacted area. In some implementations, the text analyzer 250 search text of the document by querying a data store/structure (which can be referred to as “screen contents data store”) that includes the text of the document included on the screen.

[0077] One example of the above-described functionality of the text analyzer 250 is described now with reference to a time identifying algorithm. In this example, the text analyzer 250 uses the time analysis algorithm to search the text within the document as a whole (or in some implementation, within and in the vicinity of the contacted area) to identify, as candidate text items, all time-related text items. This can include searching for and identifying, e.g., numbers in a particular format (HH:MM) and words indicating certain times (such as noon, two o'clock)) that are included in the document text. Similarly, in examples where another text analysis algorithm is selected (e.g., an address or contact identifying algorithm), the selected algorithm is used by the text analyzer 250 to search and parse the text in the document as a whole or a portion thereof to identify candidate text items (e.g., text items corresponding to an address or a contact). In this manner, the text analyzer 250 generates the second set of candidate text items using the text analysis algorithm selected by the utterance interpreter 280 and provided to the text analyzer 250. [0078] In some implementations, this second set of candidate text items can be correlated with the first set of candidate text items (which, as described above, is based on the gesture/spatial input). For example, this correlation can be performed by applying a text correlation or matching algorithm that compares the two sets of candidate text items and ranks or scores (to generate a correlation score) each candidate text item in the second set of candidate text items based on whether that candidate text item matches (wholly or partially) one or more other candidate text items in the first set of candidate text items. In this example, a higher score indicates that a particular candidate text item in the second set of candidate text items had a higher degree of match with a candidate text item in the first set of candidate text items (compared to the correlation for another candidate text item). Based on the correlation and the determined scores, a particular candidate text item with a highest score can be selected as representing the most likely candidate text item that the user intended to select.

[0079] Alternatively, based on the correlation and the determined scores, N candidate text items can be identified with the top-N scores indicating that one of these N candidate text items was more likely to be the text item that the user intended to select (compared to the other candidate text items). In such cases, the text analyzer 250 can include scoring rules that determine scores (e.g., a score ranging from 0 to 1) for the various candidate text items in the N candidate text items. As above, the score represents a likelihood that a given candidate text item is the text that the user intended to select. One example scoring rule includes assigning a score based on the proximity of the location of a particular candidate text item to the location of the page/document identified by the gesture/contacted area. Such a scoring rule would assign a higher score to a particular candidate text item that is closer (in proximity) to the location of the contacted area, compared to another candidate item that is located farther from the location of the contacted area. Based on such scoring rules, a particular candidate text item can be selected from the N candidate text as representing the most likely candidate text item that the user intended to select.

[0080] In some implementations, a machine learning model can be utilized to determine a particular candidate text item that represents the most likely candidate text item that the user intended to select. In such implementations, the model can be trained to accept, as inputs, the textual transcription output by the speech recognizer 230, the first and second set of candidate text items output by the text analyzer 250, and the text data included on the displayed page, and output scores for each candidate text item indicating whether the particular candidate text selection was the text that the user intended to select. Such a machine learning model can be trained, e.g., in a supervised fashion using training data, which includes features of the first and second sets of candidate text items, textual transcription output by the speech recognizer 230, and text data included on the displayed pages, and a corresponding set of ground truth labels representing text items that a user actually intended to select. Based on the scores output by such a trained model, the text analyzer 250 can select a particular candidate text item with a highest score, which represents the most likely candidate text item that the user intended to select.

[0081] Although the above operations describe generating the candidate text items based on the gesture-based input before generating the candidate text items based on the linguistic input, in some implementations, the candidate text items based on the linguistic input may be generated prior to those generated based on the spatial input. Alternatively, both sets of candidate text items may be generated in parallel. In each of these alternative implementations, the candidate text items generated based on each type of input can be correlated (as described above) to identify the most likely text item that the user intended to select.

[0082] Moreover, in some alternative implementations, the candidate text items generated based on the speech input can be sent to the gesture analyzer 230, which in turn can compare the location of each candidate text item against the location and form of the gesture input received from the user. Based on this comparison, the gesture analyzer 230 can identify a particular text item, from among the set of candidate text items generated based on the speech input, that represents the text item that the user intended to select.

[0083] In some implementations, the text analyzer 250 sends the particular candidate text item (as determined by the text analyzer 250) to the command identifier 260, which in turn uses this text item in the identification and generation of a command, as further described in the following section.

[0084] Command Identification and Execution

[0085] The textual transcription output by the speech recognizer 230 is used by the command identifier 260, to identify a command or operation that the user is requesting to be performed and to identify the parameters of the command.

[0086] Command identifier 260 can perform command identification in any number of ways. In some implementations, the command identifier 260 can be implemented as a rules-based engine that uses a set of user/developer-specified rules to detect the presence of certain words in the verbal command that are associated with particular computing operations. For example, the command identifier 260 can include a rule that specifies that a reminder needs to be created when the word “remind” (or a variation thereof, e.g., “reminder”) at or near the beginning of a spoken input (e.g., in the first five words of the spoken input) indicates that the user wants to generate a reminder and that a subset of the subsequent words of the verbal input indicate the subject of the reminder. In this example, when the verbal command is “remind me to meet Joe for lunch at this time,” the command identifier 260 applies the above example rule in determining that the intended command or operation is to generate a reminder and the subject of the reminder is “meet Joe for lunch at this time.”

[0087] The command identifier 260 further uses the particular candidate text item identified by the text analyzer 250 to replace any phrase in the identified command that relates to the most likely text item indicated by the user’s gesture. In the above FIG. 2B example, the phrase “this time” corresponds to the user gesture and would be replaced by the most likely candidate text item of 12:00 PM. Accordingly, the command identifier 260 determines that the intended command or operation is to generate a reminder and the subject of the reminder is “meet Joe for lunch at 12:00 PM.”

[0088] The command identifier 260 then provides the identified command/operation along with any data associated with the command (e.g., the subject of the reminder in the above example), to the command executor 270.

[0089] The command executor 270 in turn uses the command and the associated data to perform one or more computing operations. In some implementations, the command executor 270 identifies, using the command, the application on the computing device that is needed to execute the command. In such implementations, the digital assistant 120 can store, e.g., in a data structure, a mapping between different actions/commands and the application responsible for performing those actions/commands. For example, the data structure can include a mapping between a command that includes the word “remind” (or variations thereof) and the Reminder Application. As another example, the data structure can include a mapping between a command that includes the word “appointment” or “meeting” and the Calendar application. In the above example where the command is identified as generating a reminder to meet Joe for lunch, the command executor 270 uses the data structure’s mapping to determine that “reminder” in the identified command corresponds to the Reminder application, and thus the command should be performed using the Reminder application.

[0090] The command executor 270 interfaces with the identified application to cause the command to be executed. In the above example, the command executor 270 interfaces with the Reminder application on the user device to create a new reminder and provides the associated data (the text “meet Joe for lunch at 12:00 PM”) to the Reminder application, which uses this data to create the text of the reminder and the time for the reminder. To accomplish this, in some implementations, the command executor 270 can generate a message in a format that is understandable by the identified application — where the message includes the command’s parameters and the associated data — and sends this message to the identified application. In such implementations, each application can previously expose or make available to the digital assistant 120 (e.g., in a shared data structure stored in the computing device’s memory or via an application programming interface) the types of actions that can be performed by the application and the specified format for each such action. Thus, when the command executor 270 determines that a particular application is needed to perform a command, the command executor 120 utilizes this previously-exposed information to identify the appropriate format within which to generate the message and then generates the message using the data included in the identified command (as generated by the command identifier 260) according to the identified format.

[0091] In some scenarios where an application has not previously made available to the digital assistant 120 the types of actions that can be performed by the application and their respective formats, the command executor 270 can generate a message in a generic format that, e.g., can specify, using tags or other identifiers, the action to be performed by the application, the data for the specific action, and any additional parameters and corresponding data for those other parameters required to perform the action. In the above example where the identified command is to remind the user to meet Joe for lunch at 12:00 pm, the generated message can be in the following format (e.g., using any tag-based programming language): [Action: Create Reminder; Data: “Meet Joe for Lunch”; Parameters: {Time: 12:00 pm}]. Upon receiving such a message, the identified application can parse the message to identify the various tags/identifiers included in the message and the corresponding data and parameters, and perform the specified action using the parsed data and/or parameters. [0092] In some implementations, the identified application can send a response message to the command executor 270, specifying that the action specified in the message sent by the command executor 270 was, e.g., completed or was not completed. In instances where an action was not completed, the identified application can also specify whether additional information is needed from the action specified in the command executor 270’ s message to be completed (e.g., date for reminder, when does the user want to be reminded) or provide a reason for why the action cannot be completed.

[0093] In some implementations, the command executor 270 can implement a dialogue manager and response system that can generate and output responses to a user, request additional input from a user, and provide a response based on any additionally received user input. For example, the command executor 270 can provide a message to the user based on the response received from the identified application. Such a message to the user can indicate that the action was completed (e.g., “Reminder set”) or that the action was not completed (e.g., “Reminder was not set”). And, in instances where the application requests additional information (as described above), the command executor 270 can send a message to the user, to solicit such additional information. In the above example relating to setting a reminder, the command executor 270 can request the additional information from the user (e.g., “What date should the reminder be set for?”, “When do you want to be reminded about this?”). The user can type a responsive message within an interface or speak the response aloud, which is then transcribed (e.g., by the speech recognizer 230), and the text of the user’s responsive message is provided by the command executor 270 to the identified application. Again, the identified application can respond to the command executor 270, confirming or denying the requested action, and the command executor 270 can then communicate this response to the user and in some instances, solicit additional information from the user.

[0094] In the above description of the operations of the command executor 270 and the associated examples, the application with which the command executor 270 interacts is different from the application within which the document and the associated text is being displayed on the computing device. However, one skilled in the art will appreciate that the application with which the command executor 270 interacts can be the same one on which the document is being displayed (and with which the user interacted via the touch or another gesture-based input). This is further described and illustrated with reference to FIG. 2D below.

[0095] [0096] Although the above-described techniques are described using user inputs from two particular types of input modalities (i.e., touch-based gesturing and speech input), any combination of input modality types can be used to provide the multiple modes of input. For example, instead of or in addition to finger touch-based gesturing, one or more other spatial input types (e.g., stylusbased contact gesturing, contactless gesturing, indirect gesturing through a mouse or other such device) can be implemented. Similarly, instead of or in addition to the audio/verbal input, one or more other linguistic input types (e.g., keyboard or keypad-based typing, inputs received via myoelectric sensors, handwritten inputs) can be implemented. Moreover, in some implementations, the user may convey linguistic information using a spatial input, which can be specified, e.g., using a shape and location of a gesture.

[0097] Moreover, one skilled in the art will understand that speech recognition 230 may employ algorithms that generate more than one interpretation/hypothesis (and by extension, more than one textual transcription of the spoken utterance). The above described techniques could thus different identify multiple commands and text items that the user intended to select. In such instances, the command executor or another component of the digital assistant can communicate with the user to seek confirmation or clarification regarding which command and/or text item to select.

[0098] Further still, while the components (230, 240, 250, 260, 260, and 280) are shown as being implemented as part of the digital assistant 120 in FIG. 2C, one or more of these components can be implemented outside the digital assistant 120 or even outside the computing device 102 (e.g., as separate and standalone software components that are executed by the device processor or another data processing apparatus). For example, the utterance interpreter 280, text analyzer 250, and command identifier 260 can be implemented as part of a digital assistant that is housed in a cloud or a remote device/server (and thus, not resident on the computing device 102). In this example, these cloud-based components can operate on data output by the speech recognizer 230 and gesture analyzer 240 (e.g., data that is transmitted from the computing device to the cloud server where these components are located), and output a command that is provided to the computing device 102 for execution by the command executor 270 residing on the computing device 102.

[0099] The above-described techniques can be used for more than just selecting text from unstructured documents. This is illustrated further using the example shown in FIG. 2D, which depicts an additional example scenario in which an example digital assistant (e.g., digital assistant 120) uses inputs received using multiple input modalities to select text from a digital document and performs an operation that modifies the digital document (e.g., edits the text of the document) based on the user’s command and the selected text. From the below disclosure, one skilled in the art will appreciate that the digital assistant 120 (and in particular, command executor 270) interfaces with the same application within which the document and the associated text (with which the user interacts, e.g., using the touch or gesture-based input) is being displayed on the computing device.

[0100] In this example, and as shown at step 290, the user can point to an item, in a list, found in an unstructured section of text and say: “move this item to the start” while drawing a line from one item (which may consist of one or more words) to a location at the start of the list.

[0101] To correctly respond to this command, the text analyzer 250 can be configured to search for lists within the screen text. The text analyzer 250 identifies the particular list that the user intended (e.g., identifying the list that is closest to the touched area of the screen) and identifies each element of this list, as shown at step 292.

[0102] The identified elements are structured into a structured list, as shown at step 294, and the text analyzer 250 then modifies the structured list by moving the identified element (i.e., the element that the user intended to have moved) to the location in the list to which it is to be moved, as indicated by the user’s gesture (as shown at step 296).

[0103] The modified list can then be written back to unstructured text and when so written, the text can be written to retain correct punctuation and grammatical formatting (e.g., changing the new-first item in the list to begin with an upper case letter, changing the new-last item in the list to begin with a lower case letter, maintaining the “and” between the last item and the second-to- last item in the list), as shown at step 298.

[0104] As another example, the user could point to a month in a date “January 30, 2022” and “Say move this to next month” and the text analyzer 250 would be configured to identify the date object and rewrite it as “February 28, 2022.” In some instances, based on the user command, the digital assistant 120 may not be able to resolve what precise action the user intended (in the above example, the digital assistant 120 might not be able to resolve whether the user intended to move the event up by 30 days or move it to a date 4 weeks later). In such instances, the digital assistant 120 can be configured to request confirmation or clarification of commands, or inform the user what action was taken in response to a command and seek confirmation that the action performed was one that the user intended. In the above example, the digital assistant could perform the date change and generate the following prompt that is output to the user: “I have moved the date up 29 days, to the 28 th of February”.

[0105] FIG. 3 is a flow diagram of an example process 300 for using user inputs from multiple input modalities to select text from documents and perform operations based on the selected text. Operations of the process 300 are illustratively described below as being implemented, for example, by the digital assistant 120 and its associated components, as shown in FIG. 2C. Operations of the process 300 can also be implemented as instructions stored on one or more computer readable media, which may be non-transitory, and execution of the instructions by one or more data processing apparatus can cause the one or more data processing apparatus to perform the operations of the process 300.

[0106] The digital assistant 120 detects, during display of a document on the computing device, a first user input that is made using a first input modality and a second user input that is made using a second input modality that is different from the first input modality (at operation 310). As described above, an input modality is a modality or mode by which a user interacts with data provided for display on a computing device, such as a document (a document can include, without limitation, a web page, text and other objects presented on a user interface, an image, pages of data and text displayed in applications, or any other text, spreadsheet, or other images or objects).

[0107] As described above, generally, there are two input modalities: (1) a spatial input modality in which a user communicates or interacts with a computing device using spatial movements, and (2) a linguistic input in which a user interacts with a computing device using a sequence of words. As described with reference to FIGS. 2A-2C, a spatial input mode/modality can include various types of input, including without limitation touch input, stylus-based input, or contactless gesturebased input, wherein the contactless gesture-based input can include, e.g., input based on eyemovement tracking, feet position tracking, or hand position tracking. And, as described with reference to FIGS. 2A-2C, a linguistic input mode/modality can include various types of input, including without limitation, verbal/spoken input, other types of audio input, keyboard input or similar input provided by typing or handwriting words, or input received via myoelectric sensors. [0108] In some implementations, the digital assistant 120 can include a gesture analyzer 240 that detects the spatial input, as described with reference to FIG. 2C. In such implementations, the gesture analyzer 240 can detect, e.g., a touch input on the touch screen display of the computing device (e.g., by interacting/interfacing with a device driver of the computing device) and determine that the touch input was within a particular area of a displayed document (e.g., an area that includes text, or an area including one or more objects, such as images), as further described with reference to FIG. 2C. Moreover, in such implementations (and as described with reference to FIG. 2C), the speech recognizer 230 can detect a linguistic input (e g., a speech input in audio received by the computing device (e.g., by interacting/interfacing with a microphone of the computing device), which is often issued around the same time as the spatial input (e.g., within two-to-three second before or after the spatial input was received).

[0109] The digital assistant 120 generates, based on the first user input, a set of candidate text items (at operation 320). In some implementations, and as described with reference to FIG. 2C, the text analyzer 250 can analyze the gestured-to area (e.g., in the case of a touch input, the contacted area of the display corresponding to a portion of the document displayed on the computing device) to identify multiple text items in the contacted area. In some implementations, and as described with reference to FIG. 2C, the text analyzer 250 generates scores for each of the multiple text items identified in the contacted area of the displayed document, and selects the set of candidate items based on the scores of the multiple text items (where a score for a text item indicates a likelihood that the text item was intended to be selected by the spatial input).

[0110] The digital assistant 120 selects, based at least on the second user input, a particular text item from among the set of candidate text items (at operation 330). In some implementations, the speech recognizer 230 processes the speech input (or another apparatus processes/parses another detected linguistic input) to generate a textual transcription therefrom. The utterance interpreter 280 uses the textual transcription to select a text analysis algorithm and provides the selected text analysis algorithm to the text analyzer 250 (as further described with reference to FIG. 2C).

[OHl] In some implementations, the text analyzer 250 uses the selected text analysis algorithm (which can be used to identify particular types of text, e.g., times, dates, addresses, etc.) to search the text in the document (e.g., text in the entire document or a portion thereof) and identify a second set of candidate text items (as further described with reference to FIG. 2C). The text analyzer 250 can then correlate the first and second sets of candidate text items to identify a particular text item that represents the text item that the user most likely intended to select (as further described with reference to FIG. 2C).

[0112] The digital assistant 120 generates an instruction to execute on the computing device (at operation 340). In some implementations, and as described with reference to FIGS. 2A-2D, a command identifier 260 uses the textual transcription (as obtained by parsing the linguistic input, e.g., by the speech recognizer 230) and the particular text item (as determined at operation 330) to generate a command/instruction for execution on the computing device.

[0113] The digital assistant 120 instructs the computing device to execute the generated instruction (at operation 350). In some implementations, and as described with reference to FIGS. 2A-2D, the command identifier 260 instructs the command executor 270 to execute the generated instruction, and the command executor 270 then executes the generated instruction, either directly or indirectly (e.g., by interfacing with one or more applications implicated by the instruction, such as, e.g., a reminder application, a word editing application).

[0114] FIG. 4 illustrates a schematic diagram of a general-purpose network component or computer system 400.

[0115] The general-purpose network component or computer system 400 includes a processor 402 (which may be referred to as a central processor unit or CPU) that is in communication with memory devices including secondary storage 404, and memory, such as ROM 406 and RAM 408, input/output (VO) devices 410, and a network 412, such as the Internet or any other well-known type of network, that may include network connectivity devices, such as a network interface. Although illustrated as a single processor, the processor 402 is not so limited and may comprise multiple processors.

[0116] The processor 402 may be implemented as one or more CPU chips, cores (e.g., a multicore processor), FPGAs, ASICs, and/or DSPs, and/or may be part of one or more ASICs. The processor 402 may be configured to implement any of the schemes described herein. The processor 402 may be implemented using hardware, software, or both.

[0117] The secondary storage 404 is typically comprised of one or more disk drives or tape drives and is used for non-volatile storage of data and as an over-flow data storage device if the RAM 408 is not large enough to hold all working data. The secondary storage 404 may be used to store programs that are loaded into the RAM 408 when such programs are selected for execution. The ROM 406 is used to store instructions and perhaps data that are read during program execution. The ROM 406 is a non-volatile memory device that typically has a small memory capacity relative to the larger memory capacity of the secondary storage 404. The RAM 408 is used to store volatile data and perhaps to store instructions. Access to both the ROM 406 and the RAM 408 is typically faster than to the secondary storage 404.

[0118] It is understood that by programming and/or loading executable instructions onto the network component or computing system 400, at least one of the processor 402 or the memory (e.g. ROM 406, RAM 408) are changed, transforming the network component or computing system 400 in part into a particular machine or apparatus.

[0119] Similarly, it is understood that by programming and/or loading executable instructions onto the network component or computing system 400, at least one of the processor 402, the ROM 406, and the RAM 408 are changed, transforming the network component or computing system 400 in part into a particular machine or apparatus, e.g., a router.

[0120] It is fundamental to the electrical engineering and software engineering arts that functionality that can be implemented by loading executable software into a computer can be converted to a hardware implementation by well-known design rules. Decisions between implementing a concept in software versus hardware typically hinge on considerations of stability of the design and numbers of units to be produced rather than any issues involved in translating from the software domain to the hardware domain. Generally, a design that is still subject to frequent change may be preferred to be implemented in software, because re-spinning a hardware implementation is more expensive than re-spinning a software design. Generally, a design that is stable that will be produced in large volume may be preferred to be implemented in hardware, for example in an ASIC, because for large production runs the hardware implementation may be less expensive than the software implementation. Often a design may be developed and tested in a software form and later transformed, by well-known design rules, to an equivalent hardware implementation in an application specific integrated circuit that hardwires the instructions of the software. In the same manner as a machine controlled by a new ASIC is a particular machine or apparatus, likewise a computer that has been programmed and/or loaded with executable instructions may be viewed as a particular machine or apparatus.

[0121] The technology described herein can be implemented using hardware, firmware, software, or a combination of these. The software used is stored on one or more of the processor readable storage devices described above to program one or more of the processors to perform the functions described herein. The processor readable storage devices can include computer readable media such as volatile and non-volatile media, removable and non-removable media. By way of example, and not limitation, computer readable media may comprise computer readable storage media and communication media. Computer readable storage media may be implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Examples of computer readable storage media include RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by a computer. A computer readable medium or media does (do) not include propagated, modulated or transitory signals.

[0122] Communication media typically embodies computer readable instructions, data structures, program modules or other data in a propagated, modulated or transitory data signal such as a carrier wave or other transport mechanism and includes any information delivery media. The term “modulated data signal” means a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired media such as a wired network or direct-wired connection, and wireless media such as RF and other wireless media. Combinations of any of the above are also included within the scope of computer readable media.

[0123] In alternative embodiments, some or all of the software can be replaced by dedicated hardware logic components. For example, and without limitation, illustrative types of hardware logic components that can be used include Field-programmable Gate Arrays (FPGAs), Application-specific Integrated Circuits (ASICs), Application-specific Standard Products (ASSPs), System-on-a-chip systems (SOCs), Complex Programmable Logic Devices (CPLDs), special purpose computers, etc. In one embodiment, software (stored on a storage device) implementing one or more embodiments is used to program one or more processors. The one or more processors can be in communication with one or more computer readable media/ storage devices, peripherals and/or communication interfaces. [0124] The operations described in this specification can be implemented as operations performed by a data processing apparatus on data stored on one or more computer-readable storage devices or received from other sources.

[0125] The term “data processing apparatus” encompasses all kinds of apparatus, devices, and machines for processing data, including by way of example a programmable processor, a computer, a system on a chip, or multiple ones, or combinations, of the foregoing. The apparatus can include special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). The apparatus can also include, in addition to hardware, code that creates an execution environment for the computer program in question, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, a cross-platform runtime environment, a virtual machine, or a combination of one or more of them. The apparatus and execution environment can realize various different computing model infrastructures, such as web services, distributed computing and grid computing infrastructures.

[0126] A computer program (also known as a program, software, software application, script, or code) can be written in any form of programming language, including compiled or interpreted languages, declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, object, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, subprograms, or portions of code). A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network.

[0127] It is understood that the present subject matter may be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this subject matter will be thorough and complete and will fully convey the disclosure to those skilled in the art. Indeed, the subject matter is intended to cover alternatives, modifications and equivalents of these embodiments, which are included within the scope and spirit of the subject matter as defined by the appended claims. Furthermore, in the following detailed description of the present subject matter, numerous specific details are set forth in order to provide a thorough understanding of the present subject matter. However, it will be clear to those of ordinary skill in the art that the present subject matter may be practiced without such specific details.

[0128] Aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatuses (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable instruction execution apparatus, create a mechanism for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

[0129] The description of the present disclosure has been presented for purposes of illustration and description, but is not intended to be exhaustive or limited to the disclosure in the form disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the disclosure. The aspects of the disclosure herein were chosen and described in order to best explain the principles of the disclosure and the practical application, and to enable others of ordinary skill in the art to understand the disclosure with various modifications as are suited to the particular use contemplated. For purposes of this document, each process associated with the disclosed technology may be performed continuously and by one or more computing devices. Each step in a process may be performed by the same or different computing devices as those used in other steps, and each step need not necessarily be performed by a single computing device.

[0130] Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.