Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
STOPPING ROBOT MOTION BASED ON SOUND CUES
Document Type and Number:
WIPO Patent Application WO/2020/056373
Kind Code:
A1
Abstract:
Embodiments provide methods and systems to modify motion of a robot based on sound and context. An embodiment detects a sound in an environment and processes the sound. The processing includes comparing the detected sound to a library of sound characteristics associated with sound cues and/or extracting features or characteristics from the detected sound using a model. Motion of a robot is modified based on a context of the robot and at least one of: (i) the comparison, (ii) the features extracted from the detected sound, and (iii) the characteristics extracted from the detected sound.

Inventors:
JOHNSON DAVID (US)
WAGNER SYLER (US)
TAYOUN ANTHONY (US)
LINES STEVEN (US)
Application Number:
PCT/US2019/051175
Publication Date:
March 19, 2020
Filing Date:
September 13, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
CHARLES STARK DRAPER LABORATORY INC (US)
International Classes:
B25J9/16
Foreign References:
US20020158599A12002-10-31
US4896357A1990-01-23
DE3723329A11988-01-21
US20030060930A12003-03-27
Attorney, Agent or Firm:
MEAGHER, Timothy J. et al. (US)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. A method for modifying motion of a robot, the method comprising:

detecting a sound in an environment using a sound capturing device;

processing the detected sound, the processing including at least one of:

comparing the detected sound to a library of sound characteristics associated with sound cues; and

extracting features or characteristics from the detected sound using a model; and

modifying motion of a robot based on a context of the robot and at least one of: (i) the comparison, (ii) the features extracted from the detected sound, and (iii) the characteristics extracted from the detected sound.

2. The method of Claim 1, further comprising:

creating the library of sound characteristics associated with the sound cues by: recording a plurality of sounds in an environment;

identifying one or more of the recorded plurality of sounds as a sound cue; determining sound characteristics of the one or more plurality of sounds

identified as a sound cue;

associating the determined sound characteristics with the one or more plurality of sounds identified as a sound cue in computer memory of the library; and

associating, in the computer memory of the library, a respective action rule with the one or more plurality of sounds identified as a sound cue.

3. The method of Claim 2, wherein identifying one or more of the recorded plurality of sounds as a sound cue is based upon at least one of:

user input flagging a given sound as a sound cue;

context obtained from analyzing non-sound sensor input; and

output of a neural network trained to identify sound cues using the recorded plurality of sounds as input.

4. The method of Claim 1, wherein comparing the detected sound to the library of sound characteristics associated with sound cues includes:

processing the detected sound using a neural network trained to identify one or more characteristics of the detected sound that matches at least one of the sound characteristics associated with the sound cues.

5. The method of Claim 1, wherein comparing the detected sound to the library of sound characteristics associated with sound cues includes:

identifying a sound characteristic of the detected sound matching a given sound characteristic associated with a given sound cue in the library.

6. The method of Claim 5, wherein based on the context and the comparison, modifying motion of the robot includes:

identifying one or more action rules associated with the given sound cue; and modifying the motion of the robot to be in accordance with the one or more action rules.

7. The method of Claim 6, wherein at least one of the one or more action rules dictates a first result for the motion of the robot and a second result for the motion of the robot, where the motion of the robot is modified to be in accordance with the first result or the second result based upon the context of the robot.

8. The method of Claim 1, wherein the sound cues include at least one of: a keyword, a phrase, a sound indicating a safety-relevant event, and a sound relevant to an action.

9. The method of Claim 1, wherein context of the robot includes at least one of: torque of a joint of the robot; velocity of a link of the robot; acceleration of a link of the robot, jerk of a link of the robot; force of an end effector attached to the robot; torque of an end effector attached to the robot; pressure of an end effector attached to the robot; velocity of an end effector attached to the robot; acceleration of an end effector attached to the robot; task performed by the robot; and characteristics of an environment in which the robot is operating.

10. The method of Claim 1, wherein modifying motion of the robot includes:

comparing the context of the robot to a library of contexts to detect a matching context;

identifying one or more action rules associated with the matching context; and modifying the motion of the robot to be in accordance with the one or more action rules.

11. The method of Claim 10, further comprising creating the library of contexts by:

recording a plurality of contexts in an environment; and

associating, in computer memory of the library, a respective action rule with one or more of the plurality of recorded contexts.

12. The method of Claim 11, wherein recording the plurality of contexts in the

environment uses at least one of: a vision sensor; a depth sensor; a torque sensor; and a position sensor.

13. The method of Claim 11 further comprising identifying the respective action rule associated with the one or more of the plurality of recorded contexts by:

processing the plurality of recorded contexts to identify at least one of: a pattern in the environment in which the contexts were captured and a condition in the environment in which the contexts were captured; and

identifying the respective action rule using at least one of the identified pattern and condition.

14. The method of Claim 13 wherein processing the plurality of recorded contexts to identify at least one of a pattern and a condition includes at least one of:

comparing the plurality of recorded contexts to a library of predefined context conditions; and

evaluating output of a neural network trained to identify patterns or conditions of a context from the plurality of recorded contexts.

15. A system for modifying motion of a robot, the system comprising: a processor; and

a memory with computer code instructions stored thereon, the processor and the memory, with the computer code instructions, being configured to cause the system to:

detect a sound in an environment using a sound capturing device; process the detected sound, the processing including at least one of: comparing the detected sound to a library of sound

characteristics associated with sound cues; and

extracting features or characteristics from the detected sound using a model; and

modify motion of a robot based on a context of the robot and at least one of: (i) the comparison, (ii) the features extracted from the detected sound, and (iii) the characteristics extracted from the detected sound.

16. The system of Claim 15 where, in comparing the detected sound to the library of sound characteristics associated with sound cues, the processor and the memory, with the computer code instructions, are further configured to cause the system to:

identify a sound characteristic of the detected sound matching a given sound characteristic associated with a given sound cue in the library.

17. The system of Claim 16 where, in modifying motion of the robot based on the

comparison and context, the processor and the memory, with the computer code instructions, are further configured to cause the system to:

identify one or more action rules associated with the given sound cue; and modify the motion of the robot to be in accordance with the one or more action rules.

18. The system of Claim 15 where, in modifying motion of the robot, the processor and the memory, with the computer code instructions, are further configured to cause the system to:

compare the context of the robot to a library of contexts to detect a matching context;

identify one or more action rules associated with the matching context; and modify the motion of the robot to be in accordance with the one or more action rules.

19. The system of Claim 15 wherein the processor and the memory, with the computer code instructions, are further configured to cause the system to:

create the library of sound characteristics associated with the sound cues by: recording a plurality of sounds in an environment;

identifying one or more of the recorded plurality of sounds as a sound cue;

determining sound characteristics of the one or more plurality of sounds identified as a sound cue;

associating the determined sound characteristics with the one or more plurality of sounds identified as a sound cue in computer memory of the library; and

associating, in the computer memory of the library, a respective action rule with the one or more plurality of sounds identified as a sound cue.

20. A non-transitory computer program product for modifying motion of a robot, the computer program product comprising a computer-readable medium with computer code instructions stored thereon, the computer code instructions being configured, when executed by a processor, to cause an apparatus associated with the processor to: detect a sound in an environment using a sound capturing device; process the detected sound, the processing including at least one of:

comparing the detected sound to a library of sound characteristics associated with sound cues; and

extracting features or characteristics from the detected sound using a model; and

modify motion of a robot based on a context of the robot and at least one of: (i) the comparison, (ii) the features extracted from the detected sound, and (iii) the characteristics extracted from the detected sound.

Description:
Stopping Robot Motion Based On Sound Cues

RELATED APPLICATIONS

[0001] This application claims the benefit of LT.S. Provisional Application No.

62/730,703, filed on September 13, 2018, LT.S. Provisional Application No. 62/730,947, filed on September 13, 2018, LT.S. Provisional Application No. 62/730,933, filed on September 13, 2018, LT.S. Provisional Application No. 62/730,918, filed on September 13, 2018, LT.S.

Provisional Application No. 62/730,934, filed on September 13, 2018 and Li. S. Provisional Application No. 62/731,398, filed on September 14, 2018.

[0002] This application is related to LT.S. Patent Application titled“Manipulating

Fracturable And Deformable Materials LTsing Articulated Manipulators”, Attorney Docket No. 5000.1049-001; LT.S. Patent Application titled“Food-Safe, Washable, Thermally- Conductive Robot Cover”, Attorney Docket No. 5000.1050-000; LT.S. Patent Application titled“Food-Safe, Washable Interface For Exchanging Tools”, Attorney Docket No.

5000.1051-000; LT.S. Patent Application titled“An Adaptor for Food-Safe, Bin-Compatible, Washable, Tool-Changer Utensils”, Attorney Docket No. 5000.1052-001; U.S. Patent Application titled“Locating And Attaching Interchangeable Tools In-Situ”, Attorney Docket No. 5000.1053-001; U.S. Patent Application titled“Determining How To Assemble A Meal”, Attorney Docket No. 5000.1054-001; U.S. Patent Application titled“Controlling Robot Torque And Velocity Based On Context”, Attorney Docket No. 5000.1055-001; U.S. Patent Application titled“Robot Interaction With Human Co-Workers”, Attorney Docket No.

5000.1057-001; U.S. Patent Application titled“Voice Modification To Robot Motion Plans”, Attorney Docket No. 5000.1058-000; and U.S. Patent Application titled“One-Click Robot Order”, Attorney Docket No. 5000.1059-000, all of the above U.S. Patent Applications having a first named inventor David M.S. Johnson and all being filed on the same day, September 13, 2019.

[0003] The entire teachings of the above applications are incorporated herein by reference.

BACKGROUND

[0004] Robots operate in environments where they must avoid both fixed and moving obstacles, and often those obstacles are their human co-workers. Collisions with the objects, e.g., human co-workers, are unacceptable. Existing methods for robot-obstacle avoidance and for robot control in environments are cumbersome and inadequate.

SUMMARY

[0005] In environments, such as high-traffic restaurant kitchens, humans can typically identify dangerous situations by detecting input from various senses (hearing, touching, smelling, seeing, etc.), analyzing input from external sources (e.g., warnings from other colleagues, alarms), and understanding the contexts of these inputs. Humans can decide accordingly to alter their subsequent actions from these inputs. In contrast, using existing methods, robots cannot adequately and appropriately modify their operations based upon input from their operating environment.

[0006] Today, robots identify dangers and faults only when measurements cross certain thresholds. In particular, robots today identify dangers by (1) capturing intrusions into predefined zones, (2) measuring quantities such as torque, voltage, or current and comparing the measured quantities to predefined limits, or (3) receiving a mechanical input such as an emergency stop button. There are currently no known methods for robots to detect danger through sounds cues and by deducing context from a given set of measured inputs, whether intrinsic (information measured by the robot) or external (alarm system or human sound). Further, current methods primarily rely on information relayed by other sensors such as vision, sonar, or torque sensors, and do not consider generalized inputs such as alarms, the sound of events, e.g., breaking glass and human screams, amongst other examples.

[0007] Embodiments solve problems in relation to employing robotics in a dynamic workspace, frequently alongside human workers, and enhance a robot’s ability to sense and react to dangerous situations. Unlike existing methods, embodiments provide functionality for robots to infer from context the amount of danger that a situation presents by

incorporating one or more data sources, capturing one or more details from these one or more data sources, and using pattern matching and other analysis techniques to recognize danger.

[0008] Embodiments of the present disclosure provide methods and systems for modifying motion of a robot. One such embodiment detects a sound in an environment using a sound capturing device and then processes the detected sound. The processing includes at least one of: (1) comparing the detected sound to a library of sound characteristics associated with sound cues and (2) extracting features or characteristics from the detected sound using a model. In turn, such an embodiment modifies motion of a robot based on a context of the robot and at least one of: (i) the comparison, (ii) the features extracted from the detected sound, and (iii) the characteristics extracted from the detected sound.

[0009] An embodiment creates the library of sound characteristics associated with the sound cues. Such an embodiment creates the library by (1) recording a plurality of sounds in an environment, (2) identifying one or more of the recorded plurality of sounds as a sound cue, (3) determining sound characteristics of the one or more plurality of sounds identified as a sound cue, (4) associating the determined sound characteristics with the one or more plurality of sounds identified as a sound cue in computer memory of the library, and (5) associating, in the computer memory of the library, a respective action rule with the one or more plurality of sounds identified as a sound cue.

[0010] When creating the library, embodiments may employ a variety of different input data to identify one or more of the recorded plurality of sounds as a sound cue. For instance, embodiments may identify sounds as a sound cue based on user input flagging a given sound as a sound cue, context obtained from analyzing non-sound sensor input, and output of a neural network trained to identify sound cues using the recorded plurality of sounds as input.

[0011] According to an embodiment, comparing the detected sound to the library of sound characteristics associated with sound cues utilizes a neural network. Such an embodiment processes the detected sound using a neural network trained to identify one or more characteristics of the detected sound that matches at least one of the sound

characteristics associated with the sound cues.

[0012] In another embodiment, comparing the detected sound to the library of sound characteristics associated with sound cues includes identifying a sound characteristic of the detected sound matching a given sound characteristic associated with a given sound cue in the library. In such an embodiment, modifying the motion of the robot includes identifying one or more action rules associated with the given sound cue (the sound cue with a matching sound characteristic) and modifying the motion of the robot to be in accordance with the one or more action rules.

[0013] In an embodiment, the one or more action rules may dictate the operation of the robot given the sound cue. Similarly, the action rules may dictate the operation of the robot given the sound cue and the context of the robot. Further, the one or more action rules associated with the given sound cue may be a set of action rules. Further still, in an embodiment, at least one of the one or more action rules dictates a first result for the motion of the robot and a second result for the motion of the robot, where the motion of the robot is modified to be in accordance with the first result or the second result based upon the context of the robot.

[0014] Embodiments may treat any sound as a sound cue. For instance, the sound cues may include at least one of: a keyword, a phrase, a sound indicating a safety-relevant, e.g., dangerous event, and a sound relevant to an action. Simply, an embodiment may treat any sound relevant to operation of a robot as a sound cue.

[0015] In embodiments,“context” may include any conditions related in any way to the robot. For example, context may include any data related to the robot, the task performed by robot, the motion of the robot, and the environment in which the robot is operating, amongst other examples. In embodiments, context of the robot includes at least one of: torque of a joint of the robot; velocity of a link of the robot; acceleration of a link of the robot, jerk of a link of the robot; force of an end effector attached to the robot; torque of an end effector attached to the robot; pressure of an end effector attached to the robot; velocity of an end effector attached to the robot; acceleration of an end effector attached to the robot; task performed by the robot; and characteristics of an environment in which the robot is operating. Further, the context may include any context data as described in U.S. Patent Application titled“Controlling Robot Torque And Velocity Based On Context”, Attorney Docket No. 5000.1055-001.

[0016] According to an embodiment, modifying motion of the robot includes comparing the context of the robot to a library of contexts to detect a matching context, identifying one or more action rules associated with the matching context, and modifying the motion of the robot to be in accordance with the one or more action rules. Further, in embodiments, the motion of the robot may be modified as described in U.S. Patent Application titled“Robot Interaction With Human Co-Workers”, Attorney Docket No. 5000.1057-001.

[0017] Yet another embodiment creates the library of contexts. In such an embodiment, the context library is created by recording a plurality of contexts, e.g., data indicating context, in an environment and associating, in computer memory of the library, a respective action rule with one or more of the plurality of recorded contexts. In such an embodiment, the contexts may be recorded using any sensor known in the art that can capture context data, i.e., data relevant to the operation of a robot. For instance, the context data may be recorded using at least one of: a vision sensor, a depth sensor, a torque sensor, and a position sensor, amongst other examples. [0018] An embodiment that creates the context library may also identify the respective action rule associated with the one or more of the plurality of recorded contexts. In an embodiment, identifying the action rule associated with the recorded contexts includes (1) processing the plurality of recorded contexts to identify at least one of: a pattern in the environment in which the contexts were captured and a condition in the environment in which the contexts were captured and (2) identifying the respective action rule using at least one of the identified pattern and condition. In such an embodiment, processing the plurality of recorded contexts to identify at least one of a pattern and a condition includes at least one of (i) comparing the plurality of recorded contexts to a library of predefined context conditions and (ii) evaluating output of a neural network trained to identify patterns or conditions of a context from the plurality of recorded contexts.

[0019] Another embodiment is directed to a system for modifying motion of a robot. The system includes a processor and a memory with computer code instructions stored thereon. In such an embodiment, the processor and the memory, with the computer code instructions, are configured to cause the system to implement any embodiments described herein.

[0020] Yet another embodiment is directed to a computer program product for modifying motion of a robot. The computer program product comprises a computer-readable medium with computer code instructions stored thereon where, the computer code instructions, when executed by a processor, cause an apparatus associated with the processor to perform any embodiments described herein.

[0021] A method embodiment is provided for defming/recording one or more sound cues, e.g., keywords, phrases, and one or more sound wave characteristics (amplitude, frequency, speed) and defming/recording other sensor data, such as camera data, depth data, and torque measurements. Such an embodiment monitors for this data (sound cues and other sensor data, i.e., context data) in an environment in which a robot is operating. This monitoring detects the data and patterns and/or conditions related to this data. Upon meeting pre-defmed conditions related to the measured data in the environment, one or more rules governing the operation of the robot are executed.

[0022] Another embodiment is directed to a method for monitoring keywords, sound wave profiles, and other sensor data. Such an embodiment monitors speech or sounds and other sensor data, i.e., context data, for at least one of: (i) a pre-defmed keyword, (ii) a pre- defmed phrase, (iii) a characteristic, and (iv) a data pattern. Upon detecting the pre-defmed keyword, phrase, characteristic, and/or other sensor data, i.e., context data, pattern, such an embodiment executes a set of rules and actions that are based on the matched pre-defmed keyword, phrase, or sound wave characteristic, or context data pattern. In an embodiment, processing these rules results in identifying changes to robot motion based upon the detected pre-defmed keyword, phrase, characteristic, or context data pattern.

BRIEF DESCRIPTION OF THE DRAWINGS

[0023] The foregoing will be apparent from the following more particular description of example embodiments, as illustrated in the accompanying drawings in which like reference characters refer to the same parts throughout the different views. The drawings are not necessarily to scale, emphasis instead being placed upon illustrating embodiments.

[0024] FIG. 1 A is a block diagram illustrating an example embodiment of a quick service food environment of embodiments of the present disclosure.

[0025] FIG. 1B is a block diagram illustrating an example embodiment of the present disclosure.

[0026] FIG. 2 is a flowchart depicting a method for modifying motion of a robot according to an embodiment.

[0027] FIG. 3 is a block diagram illustrating an example system in which embodiments may be implemented.

[0028] FIG. 4 is a flowchart of an embodiment for controlling a robot in an environment.

[0029] FIG. 5 is a flowchart of a method for training a model that may be employed in embodiments.

[0030] FIG. 6 depicts a computer network or similar digital processing environment in which embodiments may be implemented.

[0031] FIG. 7 is a diagram of an example internal structure of a computer in the environment of FIG. 6.

DETAILED DESCRIPTION

[0032] A description of example embodiments follows.

[0033] Embodiments provide functionality for modifying motion of a robot. Such functionality can be employed in any variety of environments in which control of robot motion is desired. FIG. 1 A illustrates a food preparation environment 100 in which embodiments may be employed. [0034] Operating a robot in a food preparation environment, such as a quick service restaurant, can be challenging for several reasons. First, the end effectors (e.g., utensils), that the robot uses need to remain clean from contamination. Contamination can include allergens (e.g., peanuts), dietary preferences (e.g., contamination from pork for a vegetarian or kosher customer), dirt/bacteria/viruses, or other non-ingestible materials (e.g., oil, plastic, or particles from the robot itself). Second, the robot should be operated within its design specifications, and not exposed to excessive temperatures or incompatible liquids, without sacrificing cleanliness. Third, the robot should be able to manipulate food stuffs, which are often fracturable and deformable materials, and further the robot must be able to measure an amount of material controlled by its utensil in order to dispense specific portions. Fourth, the robot should be able to automatically and seamlessly switch utensils (e.g., switch between a ladle and salad tongs). Fifth, the utensils should be adapted to be left in an assigned food container and interchanged with the robot as needed, in situ. Sixth, the interchangeable parts (e.g., utensils) should be washable and dishwasher safe. Seventh, the robot should be able to autonomously generate a task plan and motion plan(s) to assemble all ingredients in a recipe, and execute that plan. Eighth, the robot should be able to modify or stop a motion plan based on detected interference or voice commands to stop or modify the robot’s plan. Ninth, the robot should be able to minimize the applied torque based on safety requirements or the task context or the task parameters (e.g., density and viscosity) of the material to be gathered. Tenth, the system should be able to receive an electronic order from a user, assemble the meal for the user, and place the meal for the user in a designated area for pickup

automatically with minimal human involvement.

[0035] Fig. l is a block diagram illustrating an example embodiment of a quick service food environment 100 of embodiments of the present disclosure. The quick service food environment 100 includes a food preparation area 102 and a patron area 120.

[0036] The food preparation area 102 includes a plurality of ingredient containers l06a-d each having a particular foodstuff (e.g., lettuce, chicken, cheese, tortilla chips, guacamole, beans, rice, various sauces or dressings, etc.). Each ingredient container l06a-d stores in situ its corresponding ingredients. Utensils l08a-d may be stored in situ in the ingredient containers or in a stand-alone tool rack 109. The utensils l08a-d can be spoons, ladles, tongs, dishers (scoopers), spatulas, or other utensils. Each utensil l08a-e is configured to mate with and disconnect from a tool changer interface 112 of a robot arm 110. While the term utensil is used throughout this application, a person having ordinary skill in the art can recognize that the principles described in relation to utensils can apply in general to end effectors in other contexts (e.g., end effectors for moving fracturable or deformable materials in construction with an excavator or backhoe, etc.); and a robot arm can be replaced with any computer controlled actuatable system which can interact with its environment to manipulate a deformable material. The robot arm 110 includes sensor elements/modules such as stereo vision systems (SVS), 3D vision sensors (e.g., Microsoft Kinect™ or an Intel RealSense™), LIDAR sensors, audio sensors (e.g., microphones), inertial sensors (e.g., internal motion unit (EMU), torque sensor, weight sensor, etc.) for sensing aspects of the environment, including pose (i.e., X, Y, Z coordinates and roll, pitch, and yaw angles) of tools for the robot to mate, shape and volume of foodstuffs in ingredient containers, shape and volume of foodstuffs deposited into food assembly container, moving or static obstacles in the environment, etc.

[0037] To initiate an order, a patron in the patron area 120 enters an order 124 in an ordering station l22a-b, which is forwarded to a network 126. Alternatively, a patron on a mobile device 128 can, within or outside of the patron area 120, generate an optional order 132. Regardless of the source of the order, the network 126 forwards the order to a controller 114 of the robot arm 110. The controller generates a task plan 130 for the robot arm 110 to execute.

[0038] The task plan 130 includes a list of motion plans l32a-d for the robot arm 110 to execute. Each motion plan l32a-d is a plan for the robot arm 110 to engage with a respective utensil l08a-e, gather ingredients from the respective ingredient container l06a-d, and empty the utensil l08a-e in an appropriate location of a food assembly container 104 for the patron, which can be a plate, bowl, or other container. The robot arm 110 then returns the utensil l08a-e to its respective ingredient container l06a-d, the tool rack 109, or other location as determined by the task plan 130 or motion plan l32a-d, and releases the utensil l08a-d. The robot arm executes each motion plan l32a-d in a specified order, causing the food to be assembled within the food assembly container 104 in a planned and aesthetic manner.

[0039] Within the above environment, various of the above described problems can be solved. The environment 100 illustrated by Fig. 1 can improve food service to patrons by assembling meals faster, more accurately, and more sanitarily than a human can assemble a meal. Some of the problems described above can be solved in accordance with the disclosure below.

[0040] For instance, operating a robot alongside human co-workers, such as in the quick service restaurant environment 100, can be challenging for a number of reasons. One the most important reasons is ensuring the safe operations of the robot and properly identifying and reacting to dangerous situations. Existing safety mechanisms either rely on a physical interface (button, switch, etc.), or on a non-contextual sensory data point (e.g., radar detecting human proximity).

[0041] In contrast FIG. 1B illustrates using an embodiment of the present disclosure to control the robot arm, i.e., robot, 110 in the environment 160 based on context and sound. In a similar environment as FIG. 1 A, the robot arm 110 includes an array of several

microphones l40a-d that are mounted on the robot arm 110. The microphones l40a-d are configured to detect and record sound waves 142. As the microphones l40a-d record the sound waves 142, the recorded sound data 143 is reported to a controller 114. The sound data 143 can be organized into data from individual microphones as mic data l44a-d. The controller 114 can process the sound data 143 and if a sound cue is detected (e.g., a stop or distress sound, e.g.,“ouch”) then the controller 114 can issue a stop command 146. Before issuing the stop comment 146, the controller 114 can also consider the context of the sound data 143. For instance, the controller 114 can consider the proximity of the sound waves 142 to the robot arm 110. If, for example, the sound 142 is far from the robot arm 110, the controller 114 would consider this when deciding to issue a stop command. Further, it is noted that the controller 114 is not limited to issuing the stop command 146 and, instead, the controller 114 can issue commands modifying the operation of the robot, such as, the robot’s motion, path, speed, and torque, amongst other examples. Further, it is noted that while the microphones l40a-d are depicted as located on the robot arm 110 and the controller 114 is located separately from the robot arm 110, embodiments are not limited to this configuration and sound capturing devices may be in any location. Similarly, the processing performed by the controller 114 may be performed by one or more processing devices that are capable of obtaining and processing sound data and issuing controls for the robot. These processing devices may be located on/in the robot or may be located locally or remotely in relation to the robot arm 110.

[0042] FIG. 1B further illustrates sound waves 150 beginning from the patron area 120. With the multiple microphones l40a-d, the controller 114 can determine a triangulated location 152 of the sound waves 150. In turn, the controller 114 can process the sound waves 150 to determine if the sound waves 150 correspond to a sound cue for which action should be taken and the controller 114 can also consider the context of the robot arm 110, such as the location 152 of the sound waves 150 in relation to the robot arm 110. Based upon the sound waves 150 and the context, the controller 114 can determine modifications, if any, for the robot arm’s 110 motion. In the example of the sound waves 150, the controller 114 can determine that the triangulated location 152 is in the patron area 120 and the controller 114 can consider the proximity of the location 152 to the robot arm 110 and ignore the sound waves 150 altogether even if the sound waves 150 correspond to a sound cue for which action would be taken if the sound cue occurred in closer proximity to the robot arm 110.

[0043] FIG. 2 is a flow chart of a method 220 for modifying motion of a robot according to an embodiment. The method 220, at 221, detects a sound in an environment using a sound capturing device. In an embodiment, the method 220 continuously operates during conditions in which the robot is configured to move and the robot can possibly collide with objects. In turn, the method 220 processes, at 222, the sound detected from 221. The processing 222 determines whether the detected sound is a sound for which action should be taken.

According to an embodiment, the processing 222 includes at least one of: (1) comparing the detected sound to a library of sound characteristics associated with sound cues and (2) extracting features or characteristics from the detected sound using a model. In an

embodiment, the comparing at 222 using a model is done via a neural network serving as the model. To continue the method 220, at 223, motion of a robot is modified based on a context of the robot and at least one of: (i) the comparison, (ii) the features extracted from the detected sound, and (iii) the characteristics extracted from the detected sound. The motion modification can take any form, such as the motion modification described in U.S. Patent Application titled“Robot Interaction With Human Co-Workers”, Attorney Docket No.

5000.1057-001, including moving to a known safe region, stopping all motion, or using the sound to apply additional context to the current action. If the current action context is dangerous, then a triggering sound cue may be configured to drop all robot joint torques below a safe threshold until a human operator signals that it is safe to continue robot operation.

[0044] To illustrate the method 220, consider the example environment 160 depicted in FIG. 1B. In such an example embodiment, at 221, the sound waves 142 are detected by the microphones l40a-d recorded as the sound data 143. In turn, at 222, the sounds data 143 is processed by the controller 114 which compares the sound data 143 to a library. In such an example, the comparison identifies that the sound data 143 matches the sound cue of a person yelling stop. At 223, based on the context, which in this example is a person’s hand approaching the food preparation area 102, and the comparison determining that the recorded sound data 143 matches the person yelling the“stop” sound cue, the controller 114 determines that the robot should be stopped and issues the stop command 146.

[0045] An embodiment of the method 220 creates the library of sound characteristics associated with the sound cues used at 222. Such an embodiment creates the library by (1) recording a plurality of sounds in an environment, (2) identifying one or more of the recorded plurality of sounds as a sound cue, (3) determining sound characteristics of the one or more plurality of sounds identified as a sound cue, (4) associating the determined sound characteristics with the one or more plurality of sounds identified as a sound cue in computer memory of the library, and (5) associating, in the computer memory of the library, a respective action rule with the one or more plurality of sounds identified as a sound cue.

[0046] According to an embodiment, creating the library as described trains a neural network, e.g., model, using the action rules associated with the plurality of sounds identified as a sound cue. As such, a neural network may be created that can receive a sound recorded in an environment and determine an appropriate action rule to be executed. In such an embodiment, sound cues may be labeled by what the sounds indicate, e.g., collisions, broken plate, etc. As such, the sound cues may be labeled with a classification of the sound. Further, sound cues may also be associated with the context data of the conditions under which the sounds were recorded, e.g., location. In such and embodiment, the library may associate, in the computer memory, an action rule with the sounds identified as a sound cue and the relevant context data. This data, sound characteristics of a sound cue, context data, and action rule(s) may be used to train a neural network and thus, the trained neural network can identify action rules to execute given input sound data and context data.

[0047] Action rules may indicate any action given associated conditions, e.g., sound and context. To illustrate, one action rule may indicate that if the detected sound is“ouch,” and the context is the robot moving (likely the robot hit someone), the resulting action should be stopping the robot’s motion or if the detected sound is“ouch,” and the context is the robot is stopped and exerting a torque (likely indicating that the robot pinned a person), the robot’s motion should be changed to zero torque.

[0048] When creating the library of sound characteristics embodiments may employ a variety of different input data to identify one or more of the recorded plurality of sounds as a sound cue. For instance, embodiments may identify sounds as a sound cue based on (i) a user input flagging a given sound as a sound cue, (ii) context obtained from analyzing non-sound sensor input, and (iii) output of a machine learning method as described herein, such as the method 550. Such functionality may employ a neural network trained to identify sound cues using the recorded plurality of sounds as input. In an embodiment, the non-sound sensor may be any sensor known in the art, such as a camera, torque sensor, and force sensor, amongst other examples.

[0049] Further, an embodiment, may identify a recorded sound as a sound cue using a neural network trained to identify a recorded sound as a sound cue based on the recorded sound and the non-sound sensor context data. To illustrate, consider an example where a recorded sound is a collision. The collision itself is identifiable in an image (non-sound sensor input). A neural network can be trained to identify the sound (the collision) as a sound cue based upon input of the image showing the collision.

[0050] In the example of identifying a sound as a sound cue from context obtained from analyzing non-sound sensor input, this non-sound sensor may be any sensor known in the art, such as a camera, depth sensor, torque sensor, camera, lidar, thermometer, and pressure sensor, amongst other examples. To illustrate, consider an example where as part of creating the library, the sound of glass breaking is recorded. In such an embodiment, context data can be obtained using image data from a camera which indicates that the robot collided with a glass object and broke the glass object. As such, it can be determined that the recorded sound of glass breaking should be a sound cue because the image showed the robot breaking an image and the sound can be stored accordingly in the library.

[0051] According to an embodiment, comparing the detected sound to the library of sound characteristics associated with sound cues at 222 utilizes a neural network. Such functionality may utilize any neural network described herein. Further, such an embodiment processes the detected sound using a neural network trained to identify one or more characteristics of the detected sound that matches at least one of the sound characteristics associated with the sound cues. In an embodiment, 222 may utilize a model that is a neural network classifier that characterizes the detected sound. Such an embodiment may simply determine if a sound is“bad” or“not bad” and, in turn, at 223, the robot’s motion is modified based upon context and whether the sound is“bad” or“not bad”. In such an embodiment, the neural work may be implemented using supervised learning where the neural network is trained with sound examples that have been labelled as“bad” or“not bad.”

[0052] In another embodiment of the method 220, comparing the detected sound to the library of sound characteristics associated with sound cues includes identifying a sound characteristic of the detected sound matching a given sound characteristic associated with a given sound cue in the library. According to an embodiment, the matching is determined through a tuned threshold which is selective to avoid false positives, but meets required levels of safety in conjunction with safe operation, such as working with human co-workers as described in U.S. Patent Application titled“Robot Interaction With Human Co-Workers”, Attorney Docket No. 5000.1057-001 and utilizing safe torques as described in U.S. Patent Application titled“Controlling Robot Torque And Velocity Based On Context”, Attorney Docket No. 5000.1055-001. In such an embodiment, the sound characteristic may be any characteristic of a sound wave known in the art, such as frequency, amplitude, direction, and velocity.

[0053] In an embodiment, modifying the motion of the robot 223 is based upon the result of the comparison, the features extracted, and/or the characteristics extracted at 222. To illustrate, in the example where the comparing 222 includes identifying a sound characteristic of the detected sound that matches a given sound characteristic associated with a given sound cue in the library, modifying the motion of the robot 223 includes identifying one or more action rules associated with the given sound cue (the sound cue with a sound characteristic that matches a sound characteristic of the recorded sound). In such an embodiment, the robot motion is modified to be in accordance with the one or more action rules.

[0054] Similarly, the modifying 223 may be done in accordance with the extracted features or characteristics. For instance, if a feature is extracted which simply indicates the detected sound is a“bad sound,” e.g., associated with injury, the robot may be stopped when the feature is extracted from the detected sound. In an embodiment, sounds which are encoded in the library as being associated with dangerous situations for the human or the co worker are used to modify the context of the executed action to either stop or change the motion of the robot as described in U.S. Patent Application titled“Controlling Robot Torque And Velocity Based On Context”, Attorney Docket No. 5000.1055-001 (torque based on context) and U.S. Patent Application titled“Robot Interaction With Human Co-Workers”, Attorney Docket No. 5000.1057-001 (working with human co-workers).

[0055] In embodiments of the method 220, the one or more action rules may dictate the operation of the robot given the sound cue. Further, the action rules may dictate the operation of the robot given the sound cue and the context of the robot. Further, the one or more action rules associated with the given sound cue may be a set of action rules. These rules may indicate different actions to take based upon different characteristics of a recorded sound and different context data of the environment in which the sound was recorded. [0056] In embodiments the set rules may be based upon different sounds, characteristics of sounds, classification of sounds, classification of characteristics of sounds, and context of sounds, e.g., location of sounds. For instance, in an embodiment, at least one of the one or more action rules dictates a first result for the motion of the robot and a second result for the motion of the robot, where the motion of the robot is modified to be in accordance with the first result or the second result based upon the context of the robot. To illustrate, again consider the example where the detected sound is glass breaking. After detecting this sound and comparing the detected sound to the library of sound characteristics associated with sound cues, it is determined that the detected sound has characteristics matching the “breaking glass” sound cue. The breaking glass sound cue has action rules which dictate a result based on context. For example, the rules may indicate that the robot’s motion should stop if the broken glass sound occurred within 10 feet of the robot and the robot can operate normally if the broken glass sound occurred more than 10 feet from the robot.

[0057] Embodiments of the method 220 may treat any sound as a sound cue. For instance, the sound cues may include at least one of: a keyword, a phrase, a sound indicating a safety-relevant, e.g., dangerous, event, and a sound relevant to an action. Simply, embodiments may treat any sound relevant to operation of a robot as a sound cue.

[0058] In embodiments of the method 220,“context” may include any conditions related, in any way, to the robot such as environmental context and operational context. For example, context may include any data related to the robot, the task performed by robot, the motion of the robot, and the environment in which the robot is operating, amongst other examples. For instance, in embodiments, context of the robot includes at least one of: torque of a joint of the robot; velocity of a link of the robot; acceleration of a link of the robot; jerk of a link of the robot; force of an end effector attached to the robot; torque of an end effector attached to the robot; pressure of an end effector attached to the robot; velocity of an end effector attached to the robot; acceleration of an end effector attached to the robot; task performed by the robot; and characteristics of an environment in which the robot is operating. Further, context may include an action state of the robot (e.g., idle, moving, changing tool, scooping, cutting, picking), state (e.g., in workspace, speed of movement, not in workspace, proximity, can collide, unable to collide) of objects (humans, robots, animals, etc.). Further, the context may include any context data as described in U.S. Patent Application titled“Controlling Robot Torque And Velocity Based On Context”, Attorney Docket No. 5000.1055-001 and the context may include predicted motion of an object as described in U.S. Patent Application titled“Robot Interaction With Human Co-Workers”, Attorney Docket No. 5000.1057-001.

By knowing the context of the action and thus the implied level of danger associated with it, the level of reaction to the sound cue can be modified. For example, if the robot is engaged in a dangerous activity which requires high torque and a sharp object, then any sound cue indicating distress results in an immediate and drastic reduction in robot output torque to below a safe threshold.

[0059] In an embodiment of the method 220, modifying motion of the robot at 223 includes comparing the context, i.e., context data, of the robot to a library of contexts, i.e., context data, to detect a matching context. Such an embodiment identifies one or more action rules associated with the matching context and modifies the motion of the robot to be in accordance with the one or more action rules. Comparing the context to a library of context may be made by a neural network or by comparing features of the context of the robot to features of the contexts in the library. In this way, embodiments may utilize statistical models.

[0060] Yet another embodiment of the method 220 creates the library of contexts. In such an embodiment, the context library is created by recording a plurality of contexts, i.e., recording data indicating contexts, in an environment and associating, in computer memory of the library, a respective action rule with one or more of the plurality of recorded contexts. In such an embodiment, the contexts may be recorded using any sensor known in the art that can capture context data, i.e., data relevant to the operation of a robot. For instance, the context data may be recorded using at least one of: a vision sensor, a depth sensor, a torque sensor, and a position sensor, amongst other examples.

[0061] An embodiment of the method 220 that creates the context library may also identify the respective action rule associated with the one or more of the plurality of recorded contexts. In an embodiment, identifying the action rule associated with the recorded contexts includes (1) processing the plurality of recorded contexts to identify at least one of: a pattern in the environment in which the contexts were captured and a condition in the environment in which the contexts were captured. In turn, the respective action rule is identified using at least one of the identified pattern and condition. In such an embodiment, processing the plurality of recorded contexts to identify at least one of a pattern and a condition includes at least one of (i) comparing the plurality of recorded contexts to a library of predefined context conditions and (ii) evaluating output of a neural network trained to identify patterns or conditions of a context from the plurality of recorded contexts. Such an embodiment may apply a modification to the technique described in U.S. Patent Application titled“Controlling Robot Torque And Velocity Based On Context”, Attorney Docket No. 5000.1055-001 (controlling torque based on context) where sounds are matched to an action context. In future execution, whenever a sound of that type is detected, it can be used to update and modify the current robot context.

[0062] Embodiments can use a neural network architecture to implement the various functionalities described herein. For instance, embodiments can utilize a convolutional neural network (CNN), a fully convolutional neural network (FCN), a recurrent neural network (RNN), a long-short term memory neural network (LSTM), or any other known neural network architecture. In embodiments, any data described herein, e.g., sound and context data, or a combination thereof, can be used to train such a neural network. In an embodiment, a neural network is trained according to methods known to those skilled in the art. According to an embodiment, a neural network which determines a robot’s reaction based on a given context is trained by using the additional information provided by the detected sounds.

Additionally, in an embodiment the sound neural network can be informed by the current context and action of the robot. For example, if the robot is handling pots and pans, the clanging and banging noises associated with that motion are indicative of normal operation. In contrast, a detected clanging or banging while preparing a stir-fry in a wok is likely to be indicative of a problem.

[0063] FIG. 3 is a block diagram illustrating an example system 330 in which

embodiments may be implemented. The system 330 comprises a computer 331, having input and output ports. The computer 331 is suitable for running software capable of running a keyword or phrase matching program, a sound wave characteristic matching program, a multi-variable pattern recognition program, and a robot controlling system, as well as other operating systems. In embodiments, the computer 331 may be any processing device known in the art such as a personal computer or a processor complex.

[0064] The computer 331 is connected to an input device 332. In embodiments, the input device 332 can be a microphone which allows a user to record a digital voice print to customize the system 330 to detect voice commands and accordingly perform a set of rules. In embodiments, the input device 332 can be used to load a set of keywords or phrases into a database 333. The input device 332 can also be used to record or load a set of sound wave characteristics (e.g., digitization of the sound of glass breaking) into the database 333. [0065] The computer 331 is communicatively coupled to the database 333 which can contain a preset or continuously changing set of keywords, phrases, sound wave

characteristics, or other sensor data. Database 333 can also be a trained neural network, trained model, or a heuristic model.

[0066] The computer 331 is also connected to a sensor 334. The sensor 334 provides contextual information to the computer 331, and can affect the rules that the system 330 executes. In embodiments, the sensor 334 may be a camera capturing a real-time feed of an environment. In embodiments, the sensor 334 may be a torque measurement device connected to the robot 335. Further, in embodiments, the sensor 334 may be a collection of cameras, torque measurement devices, and other sensors and measurement devices. The sensor 334 produces a data feed 336 which is a collection of data points coming from the variety of input sensors 334. Further, while not depicted in FIG. 3, the computer 331 may also issue commands/controls to the sensor 334.

[0067] In the system 330, the computer 331 is connected to the microphone 337 (which may be an array of audio capture devices). The microphone 337 captures sound data and relays it via data stream 338 to the computer 331.

[0068] In an embodiment, the computer 331 compares incoming audio signals from the microphone 337 to a database of sounds 333, and performs a set of predefined rules based on the comparison. The comparison can be made by matching sound wave components from data stream 338 against a library or model of known sound wave fingerprints in the database 333, or by matching a keyword or phrase against a library of pre-defmed keywords or phrases in the database 333. The comparison can be made using a Bayesian estimator, a convolutional neural network, or a recurrent neural network. In an embodiment, the comparison generates a confidence indicating whether an alert should be triggered, i.e., whether motion of the robot should be modified. In embodiments, a variety of threshold functions can be used to determine if a recorded sound should be acted upon (e.g., a single threshold value, above a threshold for a period of time, or some other function of time, confidence, and other signals in the environment).

[0069] The computer 331 may control the robot 335, based on a set of rules related to the comparison performed on the aggregation of data streams 338 (sound data) and 336 (context data), and other inputs. In an embodiment, other inputs (data in addition to the data from the microphone 337 and sensor 334) can be provided by the robot 335 to the computer 331. The computer 331 may also control output on an external display 339 such as a monitor. In an embodiment, the display 339 alerts a user whenever the system detects danger.

[0070] FIG. 4 is a flowchart of a method embodiment 440 for controlling a robot in an environment. The method 440 may be implemented on computer program code in

combination with one or more hardware devices. The computer program code may be stored on storage media, or may be transferred to a workstation over the Internet or some other type of network for execution.

[0071] The method 440 starts 441 and at 442, sound capturing devices are connected to the system. The sound capturing devices can be a microphone or an array of microphones or any other sound capturing device known in the art. In embodiments, an array of microphones is used to detect the source of a sound. At 443 additional sensors are connected to the system. These additional sensors can include cameras, depth sensors, sonars, and force torque sensors, amongst other examples. In embodiments, these sensors provide context data related to the environment in which the robot being controlled is operating. This context information can include the nature of the surroundings or the actions performed by an object in the environment.

[0072] At 444, a keyword and sound database is loaded that is indexed and searchable by different parameters such as keywords and sound characteristics, e.g., frequencies. This database can either be built through use of computer software that copies pre-defmed keywords, phrases, sound wave characteristics, and other data metrics, or by a live recording of sounds or keywords narrated by a human speaker, or through any other simulation of the data source, i.e., an environment in which is robot is being controlled. The database may also be dynamically updated based on self-generated feedback or manually using input feedback provided by a user to a particular recording.

[0073] At 445, a set of rules are defined, and associated with different data values, e.g., sounds and keyword, and other cues such as environmental context or input from other sensory devices. The rules can be pre-defmed and copied via computer software, or can be changed dynamically based on input. For example, user input may be used to customize the rules for the operating environment. The rules may also be changed dynamically based on feedback captured by a system implementing the method 440.

[0074] At 446 the robot is connected to the system. The robot may provide information, such as motion data, image capture data, or other sensor output. The robot may also be commanded by a system implementing the method 440 to modify its operation. These modifications may include reducing the robot’s speed, modifying the robot’s movement plan, or completely stopping.

[0075] At 447, the robot performs its predefined actions. The robot performing these actions can be implemented as part of software implementing the method 440 or these actions can be dictated by an independent software program.

[0076] At 448, sounds and other data inputs are monitored and processed. The processing can be an aggregation of sound data and context data, or the execution of other mathematical functions on sound and context data. For instance, an embodiment can utilize mathematical functions to perform preprocessing, filtering, data shaping, feature extracting, classification, and matching of the sound and context data. At 449, a check is made whether one or more of these data points or a collection of these data points or a pattern of these data points meet one or many conditions associated with the database or model loaded at 444. The check 449 may involve receiving words, phrases, sounds, and other inputs from a system that processes this data to remove noise or perform other mathematical transformations. If no condition is met, then the monitoring process continues at 448. However, if a match is detected, then at 450, a rule or rules are processed and executed. These rules can be executed by the robot, as shown in flow 451. Optionally, the rules can be dynamically updated based on the fact that the rule or rules have been executed. Optionally, the database and model loaded at 444 can be updated based on the fact that the rule or rules have been executed. At 452 a check is made whether the rule or rules require any human intervention. If no human intervention is needed, then the monitoring process continues at 448. However, if human intervention is needed, then at 453 the human provides the input. This input can be physical input, such as pushing a button or a switch, or a digital input, such as pushing a button on a computer display screen. Optionally, the human input can be sent as a command to the robot as shown in flow 454. After the human provided input, the monitoring process continues at 448.

[0077] FIG. 5 is a flowchart of a method 550 for training a model, i.e., a deep neural network (DNN) that may be employed in embodiments to recognize sound cues or to extract sound features and characteristics. In an embodiment, the audio is preprocessed to extract features suitable to be fed into the DNN. Feature extraction can be done using Mel-Frequency Cepstral Coefficients or other spectral analysis methods. The extracted features are fed into a neural network model such as a convolutional neural network (CNN), or into a support vector machine (SVM), or into another machine learning technique. A convolutional neural network consists of a combination of convolutional layers, max pooling layers and fully connected dense layers. The final layer is used for classifying the original sound cue using, for example, a softmax function or a mixture of softmaxes (MoS). The method 550 may be implemented using computer program code in combination with one or more hardware devices. The computer program code may be stored on storage media, or may be transferred to a workstation for execution over the Internet or any type of network.

[0078] The method 550 starts 551 and at 552 sound capturing devices are connected to a system executing the method 550. The sound capturing devices can be a microphone or an array of microphones or any other sound capturing device known in the art. In embodiments, an array of microphones connected at 552 are used to detect a sound and the source of a sound. At 553 other sensors are connected to the system. These other sensors can include cameras, depth sensors, sonars, and force torque sensors, amongst other examples. In embodiments, these sensors provide context data, such as the nature of the surroundings or the actions performed in an environment.

[0079] At 554 a robot is connected to the system executing the method 550. The robot may provide information to the overall system, such as motion data, image capture data, or other sensor output. The robot may also be commanded by a system implementing the method 550. The robot may be commanded to reduce its speed, to completely stop, or to execute actions to record and generate additional data.

[0080] At 555, sounds (from the devices connected at 552) and other data inputs such as camera feeds and torque information (from the devices connected at 553) are measured and processed. The processing can be an aggregation of this data, or the execution of other mathematical functions on this data. At 556 this data is recorded and stored in a database. This database can be indexed and searchable.

[0081] At 557, rules are defined and associated with one or more data entries or data patterns from the database. These rules can be actions to be executed by the robot, such as stopping the robot or reducing the speed of the robot’s motions.

[0082] An embodiment provides a sound-based emergency stop method to stop robot motion without a physical interface (button, switch, etc.). Such an embodiment listens for a variety of sounds which indicate a human, distress, a human command, or a mechanical impact or failure. An audio signal, received/recorded via a microphone or an array of microphones, is compared to a library of sounds (e.g., verbal cues, such as“stop” or“ouch,” and non-verbal cues, such as the sound of glass breaking, or impact between two rigid objects). The comparison can be done using a voice or acoustic model. Further, the comparison can be made by matching sound characteristics, e.g., frequency components, against a library or model of known frequency fingerprints using (i) a Bayesian estimator, (ii) a convolutional neural network, or (iii) a recurrent neural network.

[0083] In an embodiment, the comparison determines a confidence in the comparison, i.e., whether a detected sound matches a sound cue. Embodiments can utilize a variety of threshold functions (e.g., a single threshold value, above a threshold for a period of time, or some other function of time, confidence, and other signals in the environment) to determine if a detected sound matches a sound cue and should be acted upon. In response to finding a positive match of a recorded sound to a sound recorded in a library and based upon context of the robot, the robot’s motion can be modified, e.g., slowed or halted.

[0084] Embodiments can modify motion for a mobile or stationary robot. Embodiments can perform sound recognition, i.e., determining if a detected sound matches a sound cue using (i) a library of sound cues, (ii) a model of sound (i.e., frequency) cues, (iii) a trained neural network, (iv) a Bayesian estimator, (v) a convolutional neural network, or (vi) a system using a recurrent neural network architecture.

[0085] In an embodiment, sound capturing devices, such as a microphone or array of microphones, can be mounted to the robot itself. In another embodiment, sound capturing devices can be mounted to locations in an environment in which the robot operates. In an embodiment, locations of the sound capturing devices can be known to a system processing the sound to further enable noise cancellation and triangulation of a sound source. If mounted to the robot, the system can calculate the location of the sound capturing device(s) as they move with the robot. This allow an embodiment to perform calculations, e.g., sound triangulation, that are based on the dynamic location at the time the sound is recorded.

[0086] An embodiment continually monitors the sound capturing device input and determines if any sounds correspond to sounds which trigger an action, e.g., an emergency halt of the robot. In such an embodiment, a command, such as,‘emergency stop’ or‘zero torque’ can be issued to the robot.

[0087] Embodiments provide numerous benefits over existing methods for robot control. Existing solutions rely on an emergency whistle, voice commands, e.g., a shouted‘STOP’ command, or other non-verbal cues, such as excessive force, torque, or other physical signals. Other existing methods rely on position based signals such as light curtains, pressure sensors, or motion sensors; or physical switches such as an emergency stop button. Existing systems also use verbal cues to shut down alarms, such as the NEST smoke detector which looks for waiving arms and verbal cues to sense false alarms.

[0088] Currently, implementations executing emergency stops in robotics rely on a physical interface device, such as a button or switch, which can be either wired or wireless. The drawback of this approach is the human operator must remain in close physical proximity to the emergency stop device to activate the emergency stop feature. Other existing methods involve emergency stop based on sound, but are limited in scope (i.e., the specific sound of a whistle), or require specific hardware carried by the operator (i.e., using a headset).

[0089] In contrast, embodiments provide functionality to modify robot motion based on both verbal and nonverbal cues in the same system, with no hardware required for the user. The novel methods and systems described herein allow the robot to autonomously modify its motion based on non-verbal sound cues (e.g. the sound of glass breaking) without the need for a human operator to signal the modification.

[0090] Further, existing systems do not consider the variety of sounds which can occur in a robot’s environment that are indicative of a severe problem or harmful situation for a human operator. For instance, the human operator might be accidentally injured by the robot and unable to press the emergency stop button or issue a verbal‘stop’ command. However, using embodiments, the impact of a collision, for example, can be identified and processed automatically so as to modify a robot’s motion and prevent further injury.

[0091] In an embodiment, the robot does not react if it is not already moving. In other words, the robot uses context about its environment. In embodiments, certain commands may cause the robot to slow down instead of a complete stop. An embodiment can recognize the person speaking using speaker recognition so as to prevent unauthorized users from shouting commands. Embodiments can also triangulate the sound to determine the source of a sound based on an array of sound capturing devices, and providing a lower weight to sounds from a particular area, e.g., a customer area. While methods of triangulation of a source of sound are known by a person of ordinary skill of the art, these methods focus on microphones having fixed locations. In embodiments where the sound capturing devices are mounted to the robot, the sound capturing devices move as the robot arm moves and, in such an embodiment, the triangulation calculation is changed dynamically by tracking the location of the sound capturing devices. [0092] An embodiment provides a context-driven, sound or data-based emergency stop and motion reduction method to limit robot motion without a direct physical interface such as button or switch. An audio signal, received/recorded via a microphone or an array of microphones, is compared to a library of sounds (e.g., verbal cues, such as“stop” or“ouch,” and non-verbal cues, such as the sound of glass breaking). The comparison can be done using a voice or acoustic model. Similarly, other data inputs, such as a visual camera feed, depth information, and torque measurements, can be compared to a similar library of corresponding data. Similarly, a combination of this data, or a pattern of this data, can trigger a positive match for predetermined conditions. In response to such a match, a command can be issued to the robot to execute a set of predefined rules, such as reducing its speed or completely halting its motion.

[0093] An embodiment employs a mobile or stationary robot, a microphone or array of microphones, context sensors (such as camera, depth, torque), a library of data points that if detected by the sensor(s) (sound and context sensors), initiate a set of rules to be executed by the robot. Embodiments can also implement a system using a recurrent neural network architecture for data and pattern recognition. In an embodiment, the sound capturing device or array of sound capturing devices, or the context sensor or array of context sensors, can be mounted to the robot itself. In another embodiment, the array of sound capturing devices and context sensors can be mounted to locations in the environment in which the robot operates. In an embodiment, locations of the array of sound capturing devices and context sensors can be known to a system processing the data and sound to further enable noise cancellation and triangulation, i.e., locating, of data sources. If mounted to the robot, the system can calculate the location of the sound capturing devices and context sensors as they move with the robot. In an embodiment, a recurrent neural network can be used to perform speech recognition (e.g., converting audio to written text or another form) for processing.

[0094] As noted herein, existing implementations for controlling a robot are implemented via a physical interface device, such as a button or switch, comparing measurements against thresholds, such as with torque, voltage, or current limits, or by detecting boundary crossings such as intrusions into predefined zones. The drawback of these approaches is that either a human operator must remain in close physical proximity to the emergency stop device to activate the emergency stop feature, the thresholds are too conservative in nature and produce too many false positives, or certain event are missed and not captured such as glass breaking. [0095] Existing methods that involve emergency stop based on sound are limited in scope (i.e., the specific sound of a whistle), or require specific hardware carried by the operator (i.e., using a headset). Other existing methods are limited to human screams or are limited in functionality and cannot be used for real-time emergency alerts.

[0096] In contrast, embodiments enable modifying robot motion based on verbal and nonverbal cues in the same system and inferred context of the environment that is based on a collection of data sources. Unlike existing methods, embodiments require no particular hardware for users. Embodiments provide a novel approach that allows the robot to autonomously stop or modify its speed or motion based on non-verbal sound cues (e.g., the sound of glass breaking) and context data without requiring a human operator to signal the change.

[0097] Besides using sound as a trigger, embodiments can also use context of a robot’s current task and motion plan, and the state of the surroundings as measured by other sensors to inform the modifications to the robot’s movement. For instance, if no obstacles or humans are detected in the environment, then the confidence that a collision occurred is reduced. Conversely, if a human is present, and in close proximity to the robot, then it is highly likely that a collision occurred and the threshold for halting the robot is significantly reduced.

[0098] Additionally, in an embodiment, if the robot is not moving, it should not react, as this might cause additional harm, e.g. a human could have accidentally impacted a stationary robot, so the robot should not move as a result of that collision. For robots where a sudden stop might have catastrophic consequences, the reaction of the robot to the emergency signal can vary based on context.

[0099] Embodiments may use a plurality of sound capturing devices. For instance, using more than one microphone, e.g., four microphones, allows the sound origin to be determined which allows more weight to be given to commands which originate within reach of the robot.

[00100] FIG. 6 illustrates a computer network or similar digital processing environment in which embodiments of the present disclosure may be implemented. Client

computer(s)/devices 50 and server computer(s) 60 provide processing, storage, and input/output devices executing application programs and the like. The client

computer(s)/devices 50 can also be linked through communications network 70 to other computing devices, including other client devices/processes 50 and server computer(s) 60. The communications network 70 can be part of a remote access network, a global network (e.g., the Internet), a worldwide collection of computers, local area or wide area networks, and gateways that currently use respective protocols (TCP/IP, Bluetooth®, etc.) to communicate with one another. Other electronic device/computer network architectures are suitable.

[00101] FIG. 7 is a diagram of an example internal structure of a computer (e.g., client processor/device 50 or server computers 60) in the computer system of FIG. 6. Each computer 50, 60 contains a system bus 79, where a bus is a set of hardware lines used for data transfer among the components of a computer or processing system. The system bus 79 is essentially a shared conduit that connects different elements of a computer system (e.g., processor, disk storage, memory, input/output ports, network ports, etc.) that enables the transfer of information between the elements. Attached to the system bus 79 is an I/O device interface 82 for connecting various input and output devices (e.g., keyboard, mouse, displays, printers, speakers, etc.) to the computer 50, 60. A network interface 86 allows the computer to connect to various other devices attached to a network (e.g., network 70 of FIG. 6).

Memory 90 provides volatile storage for computer software instructions 92 and data 94 used to implement an embodiment of the present disclosure (e.g., structure generation module, computation module, and combination module code detailed above). Disk storage 95 provides non-volatile storage for computer software instructions 92 and data 94 used to implement an embodiment of the present disclosure. A central processor unit 84 is also attached to the system bus 79 and provides for the execution of computer instructions.

[00102] In one embodiment, the processor routines 92 and data 94 are a computer program product (generally referenced 92), including a non-transitory computer-readable medium (e.g., a removable storage medium such as one or more DVD-ROM’s, CD-ROM’s, diskettes, tapes, etc.) that provides at least a portion of the software instructions for the embodiment. The computer program product 92 can be installed by any suitable software installation procedure, as is well known in the art. In another embodiment, at least a portion of the software instructions may also be downloaded over a cable communication and/or wireless connection. In other embodiments, the invention programs are a computer program propagated signal product embodied on a propagated signal on a propagation medium (e.g., a radio wave, an infrared wave, a laser wave, a sound wave, or an electrical wave propagated over a global network such as the Internet, or other network(s)). Such carrier medium or signals may be employed to provide at least a portion of the software instructions for the present invention routines/program 92. [00103] The teachings of all patents, published applications and references cited herein are incorporated by reference in their entirety.

[00104] While example embodiments have been particularly shown and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the scope of the embodiments encompassed by the appended claims.