Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
DETECTION OF SURGICAL STATES, MOTION PROFILES, AND INSTRUMENTS
Document Type and Number:
WIPO Patent Application WO/2022/263870
Kind Code:
A1
Abstract:
An aspect includes a computer-implemented method that accesses input data including spatial data and/or sensor data temporally associated with a video stream of a surgical procedure. One or more machine-learning models predict a state of the surgical procedure based on the input data. The one or more machine-learning models detect one or more surgical instruments at least partially depicted in the video stream based on the input data. A state indicator and one or more surgical instrument indicators temporally correlated with the video stream are output. A first surgical instrument of the one or more surgical instruments is identified in the video stream, and a motion profile of the first surgical instrument is determined.

Inventors:
GIATAGANAS PETROS (GB)
ROBU MARIA RUXANDRA (GB)
GRAMMATIKOPOULOU MARIA (GB)
LUENGO MUNTION IMANOL (GB)
STOYANOV DANAIL V (GB)
CHOW ANDRE (GB)
Application Number:
PCT/GR2022/000017
Publication Date:
December 22, 2022
Filing Date:
March 18, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
DIGITAL SURGERY LTD (GB)
GIATAGANAS PETROS (GB)
International Classes:
G06K9/62; A61B90/90; G06N3/04; G06V10/82; G06V20/40; G06V40/20
Foreign References:
US20200367974A12020-11-26
Other References:
YIDAN QIN ET AL: "daVinciNet: Joint Prediction of Motion and Surgical State in Robot-Assisted Surgery", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 24 September 2020 (2020-09-24), XP081771349
QIN YIDAN ET AL: "Temporal Segmentation of Surgical Sub-tasks through Deep Learning with Multiple Data Sources", 2020 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA), IEEE, 31 May 2020 (2020-05-31), pages 371 - 377, XP033825895, DOI: 10.1109/ICRA40945.2020.9196560
WARD THOMAS M ET AL: "Computer vision in surgery", SURGERY, MOSBY, INC, US, vol. 169, no. 5, 1 December 2020 (2020-12-01), pages 1253 - 1256, XP086565444, ISSN: 0039-6060, [retrieved on 20201201], DOI: 10.1016/J.SURG.2020.10.039
Attorney, Agent or Firm:
YAZITZOGLOU, Evagelia (GR)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. A computer-implemented method comprising: accessing input data comprising video data, spatial data, and/or sensor data temporally associated with a video stream of a surgical procedure; predicting, by one or more machine-learning models, a state of the surgical procedure based on the input data; detecting, by the one or more machine-learning models, one or more surgical instruments at least partially depicted in the video stream based on the input data, wherein the one or more machine-learning models comprise a plurality of feature encoders and task-specific decoders trained as an ensemble to detect the state and the one or more surgical instruments by sharing extracted features associated with the state and the one or more surgical instruments between the feature encoders and task-specific decoders; determining a state indicator and one or more surgical instrument indicators temporally correlated with the video stream; identifying a first surgical instrument of the one or more surgical instruments in the video stream; and determining, based on a position of the first surgical instrument, a motion profile of the first surgical instrument.

2. The computer-implemented method of claim 1, wherein the motion profile of the first surgical instrument is used to provide user feedback.

3. The computer-implemented method of claim 1, further comprising: displaying a first motion profile of the first surgical instrument using a first visual attribute; and displaying a second motion profile of a second surgical instrument using a second visual attribute.

4. The computer-implemented method of claim 1, wherein the video stream is captured by an endoscopic camera from inside of a patient body.

5. The computer-implemented method of claim 1, wherein the video stream of the surgical procedure is captured by a camera from outside of a patient body. 6. The computer-implemented method of claim 1, wherein the first surgical instrument is identified using at least one machine learning model of the one or more machine-learning models.

7. The computer-implemented method of claim 1, further comprising: identifying an anatomical structure in the video stream; and displaying the anatomical structure with a graphical overlay.

8. The computer-implemented method of claim 1, wherein the motion profile is displayed as an overlay on the video stream.

9. A system comprising: a machine-learning training system configured to use a training dataset to train one or more machine-learning models to detect a state of a surgical procedure based on the training dataset and detect whether one or more surgical instruments are at least partially depicted in the training dataset; a data collection system con figured to capture a video stream of the surgical procedure in combination with one or more inputs temporally associated with multiple frames of the video stream as input data; a model execution system configured to execute the one or more machine- learning models to predict the state of the surgical procedure based on the input data and detect whether the one or more surgical instruments are at least partially depicted in the video stream based on the input data; and an output generator configured to output a state indicator, output one or more surgical instrument indicators temporally correlated with the video stream, and display one or more motion profiles of the one or more surgical instruments identified in the video stream.

10. The system of claim 9, wherein the one or more motion profiles are generated based on detecting the one or more surgical instruments and one or more anatomical structures at least partially depicted in the video stream. 1 1. The system of claim 9, wherein the output generator is configured to provide user feedback based on detecting that at least one of the one or more surgical instruments has veered off of a predetermined path by more than a predetermined threshold.

12. The system of claim 9, wherein the output generator is configured to provide user feedback based on detecting that at least one of the one or more surgical instruments is following a predetermined path within a predetermined threshold.

13. The system of claim 9, wherein the output generator is configured to output a plurality of distinct visual attributes to distinguish between displaying a plurality of motion profiles, and each of the motion profiles is associated with a different surgical instrument.

14. A computer program product comprising a memory device having computer executable instructions stored thereon, which when executed by one or more processors cause the one or more processors to perform a method comprising: predicting, by one or more machine-learning models, a state of a surgical procedure based on input data from a video stream of the surgical procedure in combination with one or more inputs temporally associated with the video stream; detecting, by the one or more machine-learning models, one or more surgical instruments at least partially depicted in the video stream based on the input data, wherein the one or more machine-learning models are trained as an ensemble to detect the state and the one or more surgical instruments by sharing extracted features associated with the state and the one or more surgical instruments; generating a motion profile of at least one of the one or more surgical instruments detected in the video stream; and outputting a state indicator, the motion profile, and one or more surgical instrument indicators temporally correlated with the video stream.

15. The computer program product of claim 14, wherein the motion profile is displayed in synchronization with a playback of a surgical video comprising the video stream.

16. The computer program product of claim 14, wherein the motion profile is recorded in a three-dimensional graph that illustrates motion of the at least one of the one or more surgical instruments over a period of time.

17. The computer program product of claim 16, wherein the three-dimensional graph comprises at least one anatomical structure in combination with the motion profile.

18. The computer program product of claim 17, wherein the three-dimensional graph comprises a plurality of time slices that are displayed and selectable to change a point in time of video playback, and the motion profile extends through two or more of the time slices.

19. The computer program product of claim 14, wherein generating the motion profile comprises applying machine learning with multi-resolution segmentation. 20. The computer program product of claim 19, wherein the multi-resolution segmentation produces a segmentation map of a plurality of regions of estimated features including at least a portion of the one or more surgical instruments.

Description:
DETECTION OF SURGICAL STATES, MOTION PROFILES, AND INSTRUMENTS

CROSS REFERENCE TO RELATED APPLICATIONS

[0001] This application is a PCT application which claims the benefit of U.S. Provisional Patent Application No. 63/211,100, filed June 16, 2021 and U.S. Provisional Patent Application No. 63/211,139, filed June 16, 2021, both of which are incorporated by reference in their entirety herein.

BACKGROUND

[0002] The present invention relates in general to computing technology and relates more particularly to computing technology for automatic detection of surgical states, motion profiles, and instruments using machine-learning prediction.

[0003] Computer-assisted systems can be useful to augment a person’s physical sensing, perception, and reaction capabilities. For example, such systems can effectively provide information corresponding to an expanded field of vision, both temporal and spatial, that enables a person to adjust current and future actions based on the part of an environment not included in his or her physical field of view. However, providing such information relies upon an ability to process part of this extended field in a useful manner. Highly variable, dynamic, and/or unpredictable environments present challenges in defining rules that indicate how representations of the environments are to be processed to output data to productively assist the person in action performance. Further, identifying and tracking multiple objects in complex scenes can be challenging where variations in lighting, obstructions, and orientation of the objects may occur.

[0004] Computer-assisted surgery (CAS) includes use of computer technology for surgical planning, and for guiding or performing surgical interventions. In some aspects CAS can include or lead to robotic surgery. Robotic surgery can include a surgical instrument that performs one or more actions in relation to an action performed by medical personnel, such as a surgeon, an assistant, a nurse, etc. Alternatively, or in addition, the surgical instrument can be part of a supervisory-control led system that executes one or more actions in a pre-programmed, or pre-trained manner. Alternatively, or in addition, the medical personnel can manipulate the surgical instrument in real-time. In yet other examples, the medical personnel carry out one or more actions via a platform that provides controlled manipulations of the surgical instrument based on the medical personnel’s actions.

SUMMARY

[0005] According to one or more aspects, a computer-implemented method accesses input data including spatial data and/or sensor data temporally associated with a video stream of a surgical procedure. One or more machine-learning models predict a state of the surgical procedure based on the input data. The one or more machine-learning models detect one or more surgical instruments at least partially depicted in the video stream based on the input data. A state indicator and one or more surgical instrument indicators temporally correlated with the video stream are determined. A first surgical instrument of the one or more surgical instruments is identified in the video stream. A motion profile of the first surgical instrument is determined based on a position of the first surgical instrument.

[0006] In one or more examples, the motion profile of the first surgical instrument can be used to provide user feedback.

[0007] In one or more examples, the computer-implemented method can include displaying a first motion profile of the first surgical instrument using a first visual attribute and displaying a second motion profile of a second surgical instrument using a second visual attribute.

[0008] In one or more examples, the video stream can be captured by an endoscopic camera from inside of a patient body. [0009J In one or more examples, the video stream of the surgical procedure can be captured by a camera from outside of a patient body.

[0010] In one or more examples, the first surgical instrument can be identified using at least one machine learning model of the one or more machine-learning models. [0011] In one or more examples, the computer-implemented method can include identifying an anatomical structure in the video stream and displaying the anatomical structure with a graphical overlay.

[0012] In one or more examples, the motion profile can be displayed as an overlay on the video stream. [0013] According to one or more examples, a system includes a machine-learning training system configured to use a training dataset to train one or more machine-learning models to detect a state of a surgical procedure based on the training dataset and detect whether one or more surgical instruments are at least partially depicted in the training dataset. The system also includes a data collection system configured to capture a video stream of the surgical procedure in combination with one or more inputs temporally associated with multiple frames of the video stream as input data. The system further includes a model execution system configured to execute the one or more machinelearning models to predict the state of the surgical procedure based on the input data and detect whether the one or more surgical instruments are at least partially depicted in the video stream based on the input data. The system also includes an output generator configured to output a state indicator, output one or more surgical instrument indicators temporally correlated with the video stream, and display one or more motion profiles of the one or more surgical instruments identified in the video stream.

[0014] In one or more examples, the one or more motion profiles are generated based on detecting the one or more surgical instruments and one or more anatomical structures at least partially depicted in the video stream. [0015] In one or more examples, the output generator is configured to provide user feedback based on detecting that at least one of the one or more surgical instruments has veered off of a predetermined path by more than a predetermined threshold.

[0016] In one or more examples, the output generator is configured to provide user feedback based on detecting that at least one of the one or more surgical instruments is following a predetermined path within a predetermined threshold.

[0017] In one or more examples, the output generator is configured to output a plurality of distinct visual attributes to distinguish between displaying a plurality of motion profiles, where each of the motion profiles is associated with a di fferent surgical instrument.

[0018] According to one or more examples, a computer program product includes a memory device having computer executable instructions stored thereon, which when executed by one or more processors cause the one or more processors to perform a method. The method can include predicting, by one or more machine-learning models, a state of a surgical procedure based on input data from a video stream of the surgical procedure in combination with one or more inputs temporally associated with the video stream. The method also includes detecting, by the one or more machine-learning models, one or more surgical instruments at least partially depicted in the video stream based on the input data, where the one or more machine-learning models are trained as an ensemble to detect the state and the one or more surgical instruments by sharing extracted features associated with the state and the one or more surgical instruments. The method includes generating a motion profile of at least one of the one or more surgical instruments detected in the video stream. The method further includes outputting a state indicator, the motion profile, and one or more surgical instrument indicators temporally correlated with the video stream.

[0019] In one or more examples, the motion profile can be displayed in synchronization with a playback of a surgical video including the video stream. [0020] In one or more examples, the motion profile is recorded in a three-dimensional graph that illustrates motion of the at least one of the one or more surgical instruments over a period of time.

[0021] In one or more examples, the three-dimensional graph can include at least one anatomical structure in combination with the motion profile.

[0022] In one or more examples, the three-dimensional graph can include a plurality of time slices that are displayed and selectable to change a point in time of video playback, and the motion profile extends through two or more of the time slices.

[0023] In one or more examples, generating the motion profile can include applying machine learning with multi-resolution segmentation.

[0024] In one or more examples, the multi-resolution segmentation can produce a segmentation map of a plurality of regions of estimated features including at least a portion of the one or more surgical instruments.

[0025] Additional technical features and benefits are realized through the techniques of the present invention. Aspects of the invention are described in detail herein and are considered a part of the claimed subject matter. For a better understanding, refer to the detailed description and to the drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

[0026] The specifics of the exclusive rights described herein are particularly pointed out and distinctly claimed in the claims at the conclusion of the specification. The foregoing and other features and advantages of the aspects of the invention are apparent from the following detailed description taken in conjunction with the accompanying drawings in which:

[0027] FIG. 1 shows a system for detection of surgical states and instruments in surgical data using machine learning according to one or more aspects; [0028] FIG. 2 depicts a flowchart of a method for detection of surgical states and instruments in surgical data using machine learning according to one or more aspects;

[0029] FIG. 3 depicts a visualization of surgical data used for training one or more machine-learning models according to one or more aspects; [0030] FIG. 4 depicts a flow diagram for joint multi-task training of machine-learning models used to detect surgical states and instruments in the surgical data according to one or more aspects;

[0031] FIG. 5 depicts an example of detecting and tracking surgical instruments according to one or more aspects; [0032] FIG. 6 depicts a flow diagram of further training of machine-learning models by applying a previously learned instrument network according to one or more aspects;

[0033] FIG. 7 depicts a flow diagram of automatic prediction of surgical states and instruments in surgical data using one or more machine-learning models according to one or more aspects; [0034] FIG. 8 depicts a computer system in accordance with one or more aspects;

[0035] FIG. 9 depicts a surgical procedure system in accordance with one or more aspects;

[0036] FIG. 10 depicts a computer-assisted surgery system in accordance with one or more aspects; [0037] FIG. 11 depicts multiple motion profiles of surgical instruments displayed with respect to a surgical video in accordance with one or more aspects;

[0038] FIG. 12 depicts a process of determining and displaying motion profiles of surgical instruments in accordance with one or more aspects; and [0039] FIG. 13 depicts an example of a machine learning model in accordance with one or more aspects.

[0040] The diagrams depicted herein are illustrative. There can be many variations to the diagram, or the operations described therein without departing from the spirit of the invention. For instance, the actions can be performed in a differing order, or actions can be added, deleted, or modified. Also, the term “coupled”, and variations thereof describe having a communications path between two elements and do not imply a direct connection between the elements with no intervening elements/connections between them. All of these variations are considered a part of the specification.

DETAILED DESCRIPTION

[0041] Exemplary aspects of technical solutions described herein relate to, among other things, devices, systems, methods, computer-readable media, techniques, and methodologies for using machine learning and computer vision to automatically predict, or detect, surgical states, motion profiles, and instruments in surgical data. More generally, aspects can include detection, tracking, and predictions associated with one or more structures, the structures being deemed to be critical for an actor involved in performing one or more actions during a surgical procedure (e.g., by a surgeon). In one or more aspects, the structures are predicted dynamically and substantially in real-time as the surgical data is being captured and analyzed by technical solutions described herein.

A predicted structure can be an anatomical structure, a surgical instrument, etc. Motion profiles can track the position and movement of one or more surgical instruments relative to anatomical structures that are at least partially depicted in a video stream. The motion profiles can be output as one or more graphical overlays during playback. The motion profiles can also be used to detect deviations or alignment with predicted motion paths.

[0042] In some instances, a computer-assisted surgical (CAS) system is provided that uses one or more machine-learning models, trained with surgical data, to augment environmental data directly sensed by an actor involved in performing one or more actions during a surgical procedure (e.g., a surgeon). Such augmentation of perception and action can increase action precision, optimize ergonomics, improve action efficacy, enhance patient safety, and improve the standard of the surgical process. The output of the one or more machine-learning models can also be an alert used to trigger a real-time notification of an alternative route in the surgical procedure, a request for assistance, an on-screen visualization of a current surgical objective of the surgical procedure, a warning by a popup or notification of incorrect instrument usage in proximity of one or more anatomy landmarks, and/or a scheduling update associated with a later surgical procedure, for example.

[0043] The surgical data provided to train the machine-learning models can include data captured during a surgical procedure, as well as simulated data. The surgical data can include time-varying image data (e.g., a simulated/real video stream from different types of cameras) corresponding to a surgical environment. The surgical data can also include other types of data streams, such as audio, radio frequency identifier (RFID), text, robotic sensors, other signals, etc. The machine-learning models are trained to predict and identify, in the surgical data, “structures” including particular tools, anatomic objects, actions being performed in the simulated/real surgical stages. In one or more aspects, the machine-learning models are trained to define one or more parameters of the models so as to learn how to transform new input data (that the models are not trained on) to identify one or more structures. During the training, the models receive as input, one or more data streams that may be augmented with data indicating the structures in the data streams, such as indicated by metadata and/or image-segmentation data associated with the input data. The data used during training can also include temporal sequences of one or more input data.

[0044] In one or more aspects, the simulated data can be generated to include image data (e.g., which can include time-series image data or video data and can be generated in any wavelength of sensitivity) that is associated with variable perspectives, camera poses, lighting (e.g., intensity, hue, etc.) and/or motion of imaged objects (e.g., tools). In some instances, multiple data sets can be generated - each of which corresponds to the same imaged virtual scene but varies with respect to perspective, camera pose, lighting, and/or motion of imaged objects, or varies with respect to the modality used for sensing, e.g., red-green-blue (RGB) images or depth or temperature. In some instances, each of the multiple data sets corresponds to a different imaged virtual scene and further varies with respect to perspective, camera pose, lighting, and/or motion of.imaged objects.

[0045] The machine-learning models can include, for instance, a fully convolutional network adaptation (FCN) and/or conditional generative adversarial network model configured with one or more hyperparameters for state and/or surgical instrument detection. For example, the machine-learning models (e.g., the fully convolutional network adaptation) can be configured to perform supervised, self-supervised or semi- supervised semantic segmentation in multiple classes - each of which corresponding to a particular surgical instrument, anatomical body part (e.g., generally or in a particular state), and/or environment. Alternatively, or in addition, the machine-learning model (e.g., the conditional generative adversarial network model) can be configured to perform unsupervised domain adaptation to translate simulated images to semantic instrument segmentations. As a further example, the machine-learning models can include one or more transformer-based networks. It is understood that other types of machine-learning models or combinations thereof can be used in one or more aspects. Machine-learning models can further be trained to perform surgical state detection and may be developed for a variety of surgical workflows, as further described herein. Machine-learning models can be collectively managed as a group, also referred to as an ensemble, where the machine-learning models are used together and may share feature spaces between elements of the models. As such, reference to a machine-learning model or machine- learning models herein may refer to a combination of multiple machine-learning models that are used together, such as operating on a same group of data. Although specific examples are described with respect to types of machine-learning models, other machinelearning and/or deep learning techniques can be used to implement the features described herein. [0046] In one or more aspects, one or more machine-learning models are trained using a joint training process to find correlations between multiple tasks that can be observed and predicted based on a shared set of input data and/or intermediate computed features. Further machine-learning refinements can be achieved by using a portion of a previously trained machine-learning network to further label or refine a training dataset used in training the one or more machine-learning models. For example, semi-supervised or self- supervised learning can be used to initially train the one or more machine-learning models using partially annotated input data as a training dataset. A surgical workflow can include, for instance, surgical phases, steps, actions, and/or other such states/activities. Aspects further described herein with respect to surgical phase can be applied to other surgical workflow states/activities, such as surgical steps and/or actions. The partially annotated training dataset may be missing labels on some of the data associated with a particular input, such as missing labels on instrument data. An instrument network learned as part of the one or more machine-learning models can be applied to the partially annotated training dataset to add missing labels to partially labeled instrument data in the training dataset. The updated training dataset with at least a portion of the missing labels populated can be used to further train the one or more machine-learning models. This iterative training process may result in model size compression for faster performance and can improve overall accuracy by training ensembles. Ensemble performance improvement can result where feature sets are shared such that feature sets related to surgical instruments are also used for surgical state detection, for example. Thus, improving the performance aspects of machine learning related to instrument data may also improve the performance of other networks that are primarily directed to other tasks. [0047] After training, the one or more machine-learning models can then be used in real-time to process one or more data streams (e.g., video streams, audio streams, RFID data, etc.). The processing can include predicting and characterizing one or more surgical states, instruments, and/or other structures within various instantaneous or block time periods. The results can then be used to identify the presence, localization, and/or use of one or more features. For example, the localization of surgical instrument(s) can include a bounding box, a medial axis, and/or any other marker or keypoint identifying the location of the surgical instrument(s). Various approaches to localization can be performed individually or jointly. The localization can be represented as coordinates in images that map to pixels depicting the surgical instrument(s) in the images. Localization of other structures, such as anatomical structures, can be used to provide locations, e.g., coordinates, heatmaps, bounding boxes, boundaries, masks, etc., of one or more anatomical structures identified and distinguish between other structures, such as surgical instruments. Anatomical structures can include organs, arteries, implants, surgical artifacts (e.g., staples, stitches, etc.), etc. A “location” of a detected feature, such as an anatomical structure, surgical instrument, etc., can be specified as multiple sets of coordinates (e.g., polygon), a single set of coordinates (e.g., centroid), or any other such manner without limiting the technical features described herein.

[0048] Alternatively, or in addition, the structures can be used to identify a stage within a surgical workflow (e.g., as represented via a surgical data structure), predict a future stage within a workflow, the remaining time of the operation, etc. Workflows can be segmented into a hierarchy, such as events, actions, steps, surgical objectives, phases, complications, and deviations from a standard workflow. For example, an event can be camera in, camera out, bleeding, leak test, etc. Actions can include surgical activities being performed, such as incision, grasping, etc. Steps can include lower-level tasks as part of performing an action, such as first stapler firing, second stapler firing, etc.

Surgical objectives can define a desired outcome during surgery, such as gastric sleeve creation, gastric pouch creation, etc. Phases can define a process during a surgical procedure, such as preparation, surgery, closure, etc. Complications can define problems such as hemorrhaging, staple dislodging, etc. Deviations can include alternative routes indicative of any type of change from a previously learned workflow. A state of a surgical procedure or a surgical workflow can refer to any level of granularity, such as one or more of events, actions, steps, surgical objectives, phases. Aspects can include workflow detection and prediction as further described herein. [0049] FIG. 1 shows a system 100 for predicting surgical states and instruments in surgical data using machine learning according to one or more aspects. System 100 uses data streams that are part of the surgical data to identify procedural states according to some aspects. System 100 includes a procedural control system 105 that collects image data and coordinates outputs responsive to predicted structures and states. The procedural control system 105 can include one or more devices (e.g., one or more user devices and/or servers) located within and/or associated with a surgical operating room and/or control center. System 100 further includes a machine-learning processing system 110 that processes the surgical data using one or more machine-learning models to identify a procedural state (also referred to as a state or a stage), which is used to identify a corresponding output. It will be appreciated that machine-learning processing system 110 can include one or more devices (e.g., one or more servers), each of which can be configured to include part or all of one or more of the depicted components of the machine-learning processing system 110. In some instances, a part, or all of machine- learning processing system 1 10 is in the cloud and/or remote from an operating room and/or physical location corresponding to a part, or all of procedural control system 105. For example, the machine-learning training system 125 can be a separate device, (e.g., server) that stores its output as the one or more trained machine-learning models 130, which are accessible by the model execution system 140, separate from the machine- learning training system 125. In other words, in some aspects, devices that “train” the models are separate from devices that “infer,” i.e., perform real-time processing of surgical data using the trained models 130.

[0050] Machine-learning processing system 110 includes a data generator 1 15 configured to generate simulated surgical data, such as a set of virtual images, or record surgical data from ongoing procedures, to train one or more machine-learning models. Data generator 1 15 can access (read/write) a data store 120 with recorded data, including multiple images and/or multiple videos. The images and/or videos can include images and/or videos collected during one or more procedures (e.g., one or more surgical procedures). For example, the images and/or video may have been collected by a user device worn by a participant (e.g., surgeon, surgical nurse, anesthesiologist, etc.) during the surgery, and/or by a non-wearable imaging device located within an operating room.

[0051] Each of the images and/or videos included in the recorded data can be defined as a base image and can be associated with other data that characterizes an associated procedure and/or rendering specifications. For example, the other data can identify a type of procedure, a location of a procedure, one or more people involved in performing the procedure, surgical objectives, and/or an outcome of the procedure. Alternatively, or in addition, the other data can indicate a stage of the procedure with which the image or video corresponds, rendering specification with which the image or video corresponds and/or a type of imaging device that captured the image or video (e.g., and/or, if the device is a wearable device, a role of a particular person wearing the device, etc.).

Further, the other data can include image-segmentation data that identifies and/or characterizes one or more objects (e.g., tools, anatomical objects, etc.) that are depicted in the image or video. The characterization can indicate the position, orientation, or pose of the object in the image. For example, the characterization can indicate a set of pixels that correspond to the object and/or a state of the object resulting from a past or current user handling. Localization can be performed using a variety of techniques for identifying objects in one or more coordinate systems.

[0052] Data generator 115 identifies one or more sets of rendering specifications for the set of virtual images. An identification is made as to which rendering specifications are to be specifically fixed and/or varied. Alternatively, or in addition, the rendering specifications that are to be fixed (or varied) are predefined. The identification can be made based on, for example, input from a client device, a distribution of one or more rendering specifications across the base images and/or videos, and/or a distribution of one or more rendering specifications across other image data. For example, if a particular specification is substantially constant across a sizable data set, the data generator 1 15 defines a fixed corresponding value for the specification. As another example, if rendering-specification values from at least a predetermined amount of data span across a range, the data generator 1 15 define the rendering specifications based on the range (e.g., to span the range or to span another range that is mathematically related to the range of distribution of the values).

[0053] A set of rendering specifications can be defined to include discrete or continuous (finely quantized) values. A set of rendering specifications can be defined by a distribution, such that specific values are to be selected by sampling from the distribution using random or biased processes.

[0054] One or more sets of rendering speci fications can be defined independently or in a relational manner. For example, if the data generator 1 15 identifies five values for a first rendering specification and four values for a second rendering specification, the one or more sets of rendering specifications can be defined to include twenty combinations of the rendering specifications or fewer (e.g., if one of the second rendering specifications is only to be used in combination with an incomplete subset of the first rendering specification values or the converse). In some instances, different rendering specifications can be identified for different procedural states and/or other metadata parameters (e.g., procedural types, procedural locations, etc.).

[0055] Using the rendering specifications and base image data, the data generator 1 15 generates simulated surgical data (e.g., a set of virtual images), which is stored at the data store 120. For example, a three-dimensional model of an environment and/or one or more objects can be generated using the base image data. Virtual image data can be generated using the model to determine - given a set of particular rendering specifications (e.g., background lighting intensity, perspective, zoom, etc.) and other procedure-associated metadata (e.g., a type of procedure, a procedural state, a type of imaging device, etc.). The generation can include, for example, performing one or more transformations, translations, and/or zoom operations. The generation can further include adjusting the overall intensity of pixel values and/or transforming RGB values to achieve particular color-specific specifications. [0056] A machine-learning training system 125 uses the recorded data in the data store 120, which can include the simulated surgical data (e.g., set of virtual images) and actual surgical data to train one or more machine-learning models. The machine-learning models can be defined based on a type of model and a set of hyperparameters (e.g., defined based on input from a client device). The machine-learning models can be configured based on a set of parameters that can be dynamically defined based on (e.g., continuous or repeated) training (i.e., learning, parameter tuning). Machine-learning training system 125 can use one or more optimization algorithms to define the set of parameters to minimize or maximize one or more loss functions. The set of (learned) parameters can be stored as a trained machine-learning model data structure 130, which can also include one or more non-learnable variables (e.g., hyperparameters and/or model definitions).

[0057] A model execution system 140 can access the machine-learning model data structure 130 and accordingly configure one or more machine-learning models for inference (i.e., prediction). The one or more machine-learning models can include, for example, a fully convolutional network adaptation, an adversarial network model, or other types of models as indicated in data structure 130. The one or more machinelearning models can be configured in accordance with one or more hyperparameters and the set of learned parameters.

[0058] The one or more machine-learning models, during execution, can receive, as input, surgical data to be processed and generate one or more inferences according to the training. For example, the surgical data can include data streams (e.g., an array of intensity, depth, and/or RGB values) for a single image or for each of a set of frames (e.g., including multiple images or an image with sequencing data) representing a temporal window of fixed or variable length in a video. The surgical data that is input can be received from a real-time data collection system 145, which can include one or more devices located within an operating room and/or streaming live imaging data collected during the performance of a procedure. Video processing can include decoding and/or decompression when a video stream is received in an encoded or compressed format such that data for a sequence of images can be extracted and processed. The surgical data can include additional data streams, such as audio data, RFID data, textual data, measurements from one or more surgical instruments/sensors, etc., that can represent stimuli/procedural state from the operating room. The different inputs from different devices/sensors are synchronized before inputting in the model.

[0059] The one or more machine-learning models can analyze the surgical data, and in one or more aspects, predict and/or characterize structures included in the visual data from the surgical data. The visual data can include image and/or video data in the surgical data. The prediction and/or characterization of the structures can include segmenting the visual data or predicting the localization of the structures with a probabilistic heatmap. In some instances, the one or more machine-learning models include or are associated with a preprocessing or augmentation (e.g., intensity normalization, resizing, cropping, etc.) that is performed prior to segmenting the visual data. An output of the one or more machine-learning models can include image- segmentation or probabilistic heatmap data that indicates which (if any) of a defined set of structures are predicted within the visual data, a location and/or position and/or pose of the structure(s) within the image data, and/or state of the structure(s). The location can be a set of coordinates in the image data. For example, the coordinates can provide a bounding box. The coordinates can provide boundaries that surround the structure(s) being predicted. The one or more machine-learning models can be trained to perform higher-level predictions and tracking, such as predicting a state of a surgical procedure and tracking one or more surgical instruments used in the surgical procedure, as further described herein. [0060] A state detector 150 can use the output from the execution of the machinelearning model to identify a state within a surgical procedure (“procedure”). A procedural tracking data structure can identify a set of potential states that can correspond to part of a performance of a specific type of procedure. Different procedural data structures (e.g., and different machine-learning-model parameters and/or hyperparameters) may be associated with different types of procedures. The data structure can include a set of nodes, with each node corresponding to a potential state.

The data structure can include directional connections between nodes that indicate (via the direction) an expected order during which the states will be encountered throughout an iteration of the procedure. The data structure may include one or more branching nodes that feed to multiple next nodes and/or can include one or more points of divergence and/or convergence between the nodes. In some instances, a procedural state indicates a procedural action (e.g., surgical action) that is being performed or has been performed and/or indicates a combination of actions that have been performed. In some instances, a procedural state relates to a biological state of a patient undergoing a surgical procedure. For example, the biological state can indicate a complication (e.g., blood clots, clogged arteries/veins, etc.), pre-condition (e.g., lesions, polyps, etc.).

[0061] Each node within the data structure can identify one or more characteristics of the state. The characteristics can include visual characteristics. In some instances, the node identifies one or more tools that are typically in use or availed for use (e.g., on a tool tray) during the state, one or more roles of people who are typically performing a surgical task, a typical type of movement (e.g., of a hand or tool), etc. Thus, state detector 150 can use the segmented data generated by model execution system 140 that indicates the presence and/or characteristics of particular objects within a field of view to identify an estimated node to which the real image data corresponds. Identification of the node (and/or state) can further be based upon previously detected states for a given procedural iteration and/or other detected input (e.g., verbal audio data that includes person-to-person requests or comments, explicit identifications of a current or past state, information requests, etc.).

[0062] An output generator 160 can use the state to generate an output. Output generator 160 can include an alert generator 165 that generates and/or retrieves information associated with the state and/or potential next events. For example, the information can include details as to warnings and/or advice corresponding to current or anticipated procedural actions. The alert generator 165 can be configured to communicate with one or more other systems, such as procedural control system 105, to provide notice or trigger actions based on the information. The information can further include one or more events for which to monitor. The information can identify the next recommended action. The alert generator 165 output can deliver tips and alerts to a surgical team to improve team coordination based on the state of the procedure. Machine learning can be used to determine the remaining time of an operation to help with preparation and scheduling of the facilities for subsequent use. Alerts can be generated to warn or notify surgeons or other parties through various devices and systems.

[0063] The user feedback can be transmitted to an alert output system 170, which can cause the user feedback to be output via a user device and/or other devices that is (for example) located within the operating room or control center. The user feedback can include a visual, audio, tactile, or haptic output that is indicative of the information. The user feedback can facilitate alerting an operator, for example, a surgeon, or any other user of the system 100. The alert output system 170 may also provide alert information 185 to one or more other systems (not depicted).

[0064] Output generator 160 can also include an augmentor 175 that generates or retrieves one or more graphics and/or text to be visually presented on (e.g., overlaid on) or near (e.g., presented underneath or adjacent to or on separate screen) real-time capture of a procedure. Augmentor 175 can further identify where the graphics and/or text are to be presented (e.g., within a specified size of a display). In some instances, a defined part of a field of view is designated as being a display portion to include augmented data. In some instances, the position of the graphics and/or text is defined so as not to obscure the view of an important part of an environment for the surgery and/or to overlay particular graphics (e.g., of a tool) with the corresponding real-world representation.

[0065] Augmentor 175 can send the graphics and/or text and/or any positioning information to an augmented reality device 180, which can integrate the graphics and/or text with a user's environment in real-time as an augmented reality visualization. Augmented reality device 180 can include a pair of goggles that can be worn by a person participating in part of the procedure. It will be appreciated that, in some instances, the augmented display can be presented at a non-wearable user device, such as at a computer or tablet. The augmented reality device 180 can present the graphics and/or text at a position as identified by augmentor 175 and/or at a predefined position. Thus, a user can maintain a real-time view of procedural operations and further view pertinent state- related information.

[0066] FIG. 2 depicts a flowchart of a method for detection of surgical states and instruments in surgical data using machine learning according to one or more aspects.

The method 200 can be executed by the system 100 of FIG. 1 as a computer-implemented method.

[0067] The method 200 includes using (in an inference phase) one or more machinelearning models 702 of FIG. 7 to detect, predict, and track surgical states being performed in a procedure and structures, such as surgical instruments, used in the procedure. The one or more machine-learning models 702 are examples of the one or more trained machine-learning models 130 of FIG. 1.

[0068] At block 202, the system 100 can access input data including, for example, video data, spatial data, and/or sensor data temporally associated with a video stream of a surgical procedure. At block 204, the one or more machine-learning models 702 can predict a state of the surgical procedure based on the input data. At block 206, the one or more machine-learning models 702 can detect one or more surgical instruments at least partially depicted in the video stream based on the input data, when such features are present. Detection of surgical instruments can include determining a presence or localization of one or more surgical instruments. The localization can include, for example, a bounding box, a medial axis, and/or any other marker or keypoint identifying the location of one or more surgical instruments. Upon surgical instrument detection through presence and/or localization, tracking can be performed to observe and predict positioning of the surgical instruments with respect to other structures. As further described herein, the one or more machine-learning models 702 can include a plurality of feature encoders and task-specific decoders trained as an ensemble to detect the state and the one or more surgical instruments by sharing extracted features associated with the state and the one or more surgical instruments between the feature encoders and task- specific decoders. At block 208, a state indicator and/or one or more surgical instrument indicators temporally correlated with the video stream can be output.

[0069] At block 210, based on the state as predicted and tracked instrument information of the one or more surgical instruments, the system 100 can output a notification, a request, a visualization, and/or a scheduling update depending on how the state and tracked instrument information are used. For example, the notification can be a real-time notification of an alternative route in the surgical procedure, the request can be a request for assistance, the visualization can be an on-screen visualization of a current surgical objective of the surgical procedure, and the scheduling update can be a scheduling update associated with a later surgical procedure. The visualization can include displaying one or more motion profiles of surgical instruments that can include tracking observed motion and/or a predicted path of motion. When observed in real-time, motion profiles can guide usage of surgical instruments during a surgical procedure.

When used during post-operative analysis, motion profiles track whether a surgical instrument followed or deviated from a predetermined path. Other types of outputs are contemplated in one or more aspects. For example, an output can include a warning by a popup or notification of incorrect instrument usage in proximity of one or more anatomy landmarks.

[0070] The one or more machine-learning models 702 of FIG. 7 can operate on the surgical data per frame, but can use information from a previous frame, or a window of previous frames. FIG. 3 depicts a visualization of surgical data used for training the one or more machine-learning models 702 according to one or more aspects. The depicted example surgical data 300 includes video data, i.e., a sequence of N images 302. For training the one or more machine-learning models 702, images 302, and other inputs can be annotated; however, the annotations may be incomplete, resulting in unlabeled data 303 within an input window 320 of input data in some instances. The annotations can include temporal annotations 306 that identify a surgical state to which an image belongs or tracking information for different structures in temporal data 305. Accordingly, a particular set or subset of images 302 represents a surgical state or tracking state. The subset of images 302 can include one or more images and may be sequential.

[0071] Further, the annotations can include spatial annotations 308 of spatial data 307 that identify one or more objects in the images 302. For example, the spatial annotations 308 can specify one or more regions of an image and identify respective objects in the regions. Further, an image can be associated with sensor annotations 310 that include values of one or more sensor measurements from sensor data 309 at the time the image was captured. The sensor measurements can be from sensors associated with the patient, such as oxygen level, blood pressure, heart rate, etc. Alternatively, or in addition, the sensor measurements can be associated with one or more components being used in the surgical procedure, such as a brightness level of an endoscope, a fluid level in a tank, energy output from a generator, etc. Sensor measures can also come from real-time robotic systems (e.g., robotic kinematics) indicating surgical activations or position or pose information about instruments. Other types of annotations can be used to train the one or more machine-learning models 702 in other aspects.

[0072] The one or more machine-learning models 702 can take into consideration one or more temporal inputs, such as sensor information, acoustic information, along with spatial annotations associated with images 302 when detecting features in the surgical data 300. A set of such temporally synchronized inputs from the surgical data 300 that are analyzed together by the one or more machine-learning models 702 can be referred to as an “input window” 320 of a training dataset 301 . However, the input window 320 can include any type of observable data, including video data, that can be temporally and/or spatially aligned from the surgical data 300 and/or other sources. The one or more machine-learning models 702, during inference, operate on the input window 320 to predict a surgical state represented by the images in the input window 320. Each image 302 in the input window 320 is associated with synchronized temporal and spatial annotations, such as measurements at a particular timepoint including sensor information, acoustic information, and/or other information.

[0073] The input window 320 of input data can span a plurality of frames of a video stream of images 302 in combination with the spatial data 307 and/or the sensor data 309 temporally associated with the frames of images 302. The input window 320 also correlates the frames of images 302 with temporal data 305. The input window 320 can slide with respect to time as the one or more machine-learning models 702 predict the state and track the one or more surgical instruments and/or as the input window 320 is used for training the one or more machine-learning models 702. In some instances, during training or in real-time use, the input window 320 slides on a frame-by-frame basis, such as starting with frame number 10 and advancing to frame number 11 as the starting position. Each iteration using the input window 320 may use available data sets of the starting position plus the next two frames before sliding the input window 320 to start at a different frame number. Thus, one iteration may act upon frame numbers 10,

1 1, and 12 as the input window 320, and the next iteration may act upon frame numbers 1 1, 12, and 13. Further, overlap during sliding of the input window 320 may be reduced, for instance, where the last frame number of one iteration becomes the starting frame number for the next iteration. Alternatively, there may be a gap 325 as one or more of the frames are skipped as the input window 320 slides to a subsequent input window 330 position. The gap 325 may result in no overlap of frame numbers between the input window 320 and the subsequent input window 330. [0074] In one or more aspects, separate instances of the one or more machine-learning models 702 can be trained for respective types of procedures. For example, separate instances of the one or more machine-learning models 702 can be trained to predict states in knee arthroscopy, laparoscopic removal of a gallbladder, endoscopic mucosal resection, and other such surgical procedures. Because each procedure can have specific states (e.g., the sequence of operations) and specific attributes (e.g., anatomical features, instruments, etc.), the one or more machine-learning models 702 can be trained to predict and identify the states of the procedures. It is understood that the technical solutions described herein are not limited to a particular type of surgical procedure unless explicitly indicated. As such, "surgical procedure" or "procedure" can be any of one or more surgeries performed, and not limited to the above-listed examples.

[0075] Training of the one or more machine-learning models 702 can be performed using various types of encoders, feature extractors, and task specific decoders to form task specific networks for desired output types of the one or more machine-learning models 702. FIG. 4 depicts a flow diagram 400 for joint multi-task training of machinelearning models used to detect surgical states and instruments in the surgical data, which can be used to train the one or more machine-learning models 702. The one or more machine-learning models 702 can be trained based on a training dataset including a plurality of temporally aligned annotated data streams, such as the temporal annotations 306, spatial annotations 308, and sensor annotations 310, where the input window 320 represents a temporal alignment of multiple frames of training data.

[0076] One or more feature encoders can be used to predict features from the surgical data for the procedure. In the example of FIG. 4, the input window 320 provides input to two feature encoders 402, 412. The feature encoders 402, 412 can be based on one or more artificial neural networks, such as convolutional neural network (CNN), recurrent neural network (RNN), feature pyramid network (FPN), transformer networks, or any other type of neural network or a combination thereof. The feature encoders 402, 412 can use a known technique, supervised, self-supervised or unsupervised (e.g., autoencoder), to learn efficient data “coding” in the surgical data. The “coding” maps input data to one or more feature spaces (e.g., feature spaces 404, 405, 414), which can be used by feature decoders to perform semantic analysis of the surgical data. In one or more aspects, a task-specific decoder 406 can predict instruments being used at an instance in the surgical data based on the predicted features. Additionally, a task-specific decoder 416 can predict the surgical states in the surgical data based on the predicted features. The feature encoders 402, 412 can be pre-trained to extract one or more features with task-agnostic input to form feature vectors or feature pyramids as the feature spaces 404, 405, 414.

[0077] In the example of FIG. 4, an instrument network 401 can be formed by the feature encoder 402, feature space 404, and task-specific decoder 406 for instrument detection 408. The instrument detection 408 can include presence or localization determinations. The instrument detection 408 can detect and/or track one or more surgical instruments at least partially depicted in one or more images of a video stream from the input window 320. The instrument detection 408 may be defined with respect to identifying one or more surgical instruments being present along with position, orientation, and/or movement. Further in FIG. 4, a state network 411 can be formed by the feature encoder 412, feature space 414, and task-specific decoder 416 to determine state detection 418. The state detection 418 can identify a state of a surgical procedure based on the input window 320 and relationships learned over a training period. In one or more aspects, the machine-learning is further enhanced by establishing relationships between the feature space 405 of the instrument network 401 with the task-specific decoder 416 of the state network 41 1. The feature space 414 of the state network 41 1 can also provide input to the task-specific decoder 406 of the instrument network 401. The joint training of multiple tasks simultaneously can be used to find correlations between tasks to improve model accuracy. Training can be supported using a fully labeled training dataset or a partially labeled training dataset. Unlabeled training data can be used with partial loss functions, for example.

[0078] The instrument network 401 can be trained to identify groups of keypoints associated with surgical instruments. Relationships between keypoints with respect to each other can form a keypoint-skeleton that associates the sequence of keypoints and relative positioning using localizations to detect and/or track movement of the surgical instruments. Thus, one or more surgical instruments can be identified based on a localization using learned groupings of keypoints per surgical instrument.

[0079] Additional task-specific networks can also be added to the ensemble with joint multi-task learning. For example, the one or more machine-learning models 702 of FIG. 7 can be trained to predict one or more locations of one or more surgical instruments relative to one or more anatomical structures associated with a particular type of the surgical procedure. Further examples can include computer vision modeling in combination with one or more artificial neural networks, such as encoders, Recurrent Neural Networks (RNN, e.g. LSTM, GRU, etc.), CNNs, Temporal Convolutional Neural Networks (TCNs), decoders, Transformers, other deep neural networks, etc. For example, an encoder can be trained using weak labels (such as lines, ellipses, local heatmaps or rectangles) or full labels (segmentation masks, heatmaps) to predict (i.e., detect and identify) features in the surgical data. In some cases, full labels can be automatically generated from weak labels by using trained machine-learning models). Encoders can be implemented using architectures, such as ResNet, VGG, or other such neural network architectures. During training, encoders can be trained using input windows 320 that includes images 302 that are annotated with the labels (weak or full).

[0080] Extracted features from the input window 320 can include one or more labels assigned to one or more portions of the surgical data in the input window 320. Other types of localizations that can be predicted by task-specific decoders can include anatomical localization that provides locations, e.g., coordinates, heatmaps, bounding boxes, boundaries, masks, etc., of one or more anatomical structures identified in the input window 320. Anatomical structures that are identified can include organs, arteries, implants, surgical artifacts (e.g., staples, stitches, etc.), etc. Further yet, based on the type of surgical procedure being performed, one or more of the predicted anatomical structures can be identi fied as critical structures for the success of the procedure. The anatomical localization, in one or more aspects, can be limited to the spatial domain (e.g., bounding box, heatmap, segmentation mask) of the critical structures but uses temporal annotations 306 of FIG. 3 to enhance temporal consistency of the predictions. The temporal annotations 306 can be based on sensor measurements, acoustic information, and other such data that is captured at the time of capturing the respective images 302 of

FIG. 3. [0081] Temporal information that is provided by state information can be used to refine confidence of the instrument prediction or anatomy prediction in one or more aspects. In one or more aspects, the temporal information can be fused with a feature space, and the resulting fused information can be used by a decoder to output instrument localization and/or anatomical localization, for example. Other visual or temporal cues may also be fused.

[0082] Feature fusion can be based on transform-domain fusion algorithms to implement a fusion neural network (FNN). The FNTN1 can fuse images, features, and/or sensor data. For example, an initial number of layers in the FNN extract salient features from the temporal information output by the first model and the feature space. Further, the extracted features are fused by an appropriate fusion rule (e.g., elementwise-max, elementwise-min, elementwise-mean, etc.) or a more complex learning-based neural network module designed to learn to weight and fuse input data (e.g., using attention modules). The fused features can be reconstructed by subsequent layers of the FNN to produce input data, such as an informative fusion image, for the decoder to analyze. Other techniques for fusing the features can be used in other aspects.

[0083] In one or more aspects, the instrument detection 408 and/or state detection 418 can further include a measure of the uncertainty of the processing, i.e., a level of confidence that the data points resulting from the processing are correct. The measure represents a confidence score of the outputs. The confidence score is a measure of the reliability of a prediction. For example, a confidence score of 95 percent or 0.95 means that there is a probability of at least 95 percent that the prediction is reliable. The confidence score can be computed as a distance transform from the central position to attenuate predictions near the boundaries. The confidence score can also be computed as a probabilistic formulation (e.g., Bayesian deep learning, probabilistic outputs like softmax or sigmoid functions, etc.). The confidence scores for various predictions can be scaled and/or normalized within a certain range, e.g., [0, 1],

[0084] FIG. 5 depicts an example of detecting and tracking surgical instruments according to one or more aspects. An image 500 can be captured from a frame in the images 302 of FIG. 3. Various annotations depicted in FIG. 5 are examples of spatial annotations 308 of FIG. 3. The annotations can be added as part of a training dataset to train the one or more machine-learning models 702 to predict one or more locations of the one or more surgical instruments 511, 531, 551 relative to one or more anatomical structures 502 associated with a particular type of surgical procedure. For example, a bounding box 510 can define a region of the image 500 where surgical instrument 51 1 is positioned. FIG. 5 is an example of multi-feature localization that can use a combination of bounding boxes, keypoints, and labels. Features of the surgical instrument 511 can include keypoints, such as a shaft start 512, a shaft end 514, a joint 516, a joint 518, and a tip 520. A bounding box 530 can define a region of the image 500 where surgical instrument 531 is positioned. Features of the surgical instrument 531 can include keypoints, such as a shaft start 532, a shaft end 534, a joint 536, a tip 538, and a tip 540.

A bounding box 550 can define a region of the image 500 where surgical instrument 551 is positioned. Features of the surgical instrument 551 can include keypoints, such as a shaft start 552, a shaft end 554, a joint 556, a joint 558, and a tip 560. Di fferent instruments can have specific keypoints and geometric relationships that can be used to identify the instruments and the position/orientation of the instruments. As the surgical instruments 511 , 531 , 551 are used, one or more keypoints can become partially or fully obstructed or may otherwise no longer be visible. The surgical instruments 51 1, 531 , 551, bounding boxes 510, 530, 550, labels and keypoints depicted in FIG. 5 are examples and other variations are possible to cover a wide variety of configurations.

[0085] The machine-learning models as described herein can provide a confidence score with results to allow flexibility in determining likely identification and localization information when the keypoints are not readily identifiable. When the one or more machine-learning models 702 are used to predict one or more locations of the one or more surgical instruments 51 1 , 531 , 551 , the graphical representation of FIG. 5 need not be displayed directly but can be used for tracking and other types of outputs, such as heat maps overlays that align with the one or more surgical instruments 51 1, 531, 551 or a portion thereof. Overlays and visualizations can be generated by the augmentor 175 of FIG. 1. Motion profiles of the one or more surgical instruments 51 1 , 531, 551 can also be determined, tracked, and displayed using techniques further described herein.

[0086] “Keypoints” of surgical instruments can be defined within a bounded region between a starting keypoint and an end, such as a tip of surgical instrument. Keypoints may also represent pivot points, such as joints, shaft starting and end points, and other such features depending upon the physical structure and function of each surgical instrument. Characteristics, such as having two tips coupled to the same joint, a sequence of joints, and other such features can be used to collectively identify the surgical instruments. Grouping keypoints can form keypoint-skeletons that collectively identify surgical instruments along with the location, orientation, and relative position with respect to other structures.

[0087] “Critical anatomical structures" can be speci fic to the type of surgical procedure being performed and identified automatically. Additionally, a surgeon or any other user can configure the system 100 to identify particular anatomical structures as critical for a particular patient. The selected anatomical structures are critical to the success of the surgical procedure, such as anatomical landmarks (e.g., Calot triangle, Angle of His, etc.) that need to be identified during the procedure or those resulting from a previous surgical task or procedure (e.g., stapled or sutured tissue, clips, etc.). The system 100 can access a plurality of surgical objectives associated with the surgical procedure and correlate the surgical objectives with the one or more surgical instruments and the state of the surgical procedure. Observations relative to critical anatomical structures and surgical objectives can be used to control alert generation. [0088] FIG. 6 depicts a flow diagram 600 of further training of machine-learning models by applying a previously learned instrument network (e.g., using a student- teacher architecture) according to one or more aspects. To further refine the training of the one or more machine-learning models 702 of FIG. 7, the flow diagram 600 uses a similar structure as the flow diagram 400 of FIG. 4. In the example of FIG. 6, the input window 320 provides input to two feature encoders 602, 612. The feature encoders 602, 612 can be based on one or more artificial neural networks, such as convolutional neural network (CNN), recurrent neural network (RNN), feature pyramid network (FPN), transformer networks, or any other type of neural network or a combination thereof. The feature encoders 602, 612 can use a known technique, supervised, self-supervised or unsupervised (e.g., autoencoder), to learn efficient data “coding” in the surgical data.

The “coding” maps input data to one or more feature spaces (e.g., feature spaces 604,

605, 614), which can be used by feature decoders to perform semantic analysis of the surgical data. In one or more aspects, a task-specific decoder 606 can predict instruments being used at an instance in the surgical data based on the predicted features.

Additionally, a task-specific decoder 616 can predict the surgical states in the surgical data based on the predicted features. The feature encoders 602, 612 can be pre -trained to extract one or more features with task-agnostic input to form feature vectors or feature pyramids as the feature spaces 604, 605, 614.

[0089] In the example of FIG. 6, an instrument network 601 can be formed by the feature encoder 602, feature space 604, and task-specific decoder 606 to determine instrument detection 608. The instrument detection 608 can detect and track one or more surgical instruments at least partially depicted in one or more images of a video stream from the input window 320 through determining presence, localization information, and/or time sequencing. The instrument detection 608 may be defined with respect to identifying one or more surgical instruments being present along with position, orientation, and/or movement. Further in FIG. 6, a state network 611 can be formed by the feature encoder 612, feature space 614, and task-specific decoder 616 to determine state detection 618. The state detection 618 can identify a state of a surgical procedure based on the input window 320 and relationships learned over a training period. In one or more aspects, machine learning is further enhanced by establishing relationships between the feature space 605 of the instrument network 601 with the task-speci fic decoder 616 of the state network 61 1. The feature space 614 of the state network 61 1 can also provide input to the task-specific decoder 606 of the instrument network 601 .

[0090] To further enhance machine learning, one or more predictions of a previously learned instrument network 401 can be applied to the training dataset of the input window 320. For example, a frame 620 of the input window 320 representing a subset of the training dataset 301 of FIG. 3 can be extracted and provided to a feature encoder 622 and task-specific decoder 624 that model the previously learned instrument network 401 after training is performed in the flow diagram 400. The encoder 622 can be a trained combination of the encoders 402, 412 of FIG. 4. The task-specific decoder 624 can be a trained combination of the decoders 406, 416 of FIG. 4. An instrument detection output 628 of the previously learned instrument network 401 can be correlated with spatial annotations 308 of the input window 320. For example, where spatial data 307 includes unlabeled data 303, the instrument detection output 628 can be used to generate annotations 630 for the unlabeled data 303. Accordingly, by first using the previously learned instrument network 401 to generate annotations 630, the training dataset associated with the input window 320 can be updated such that training of the instrument network 601 and the state network 611 can have improved performance. The instrument detection output 628 can also be correlated to the instrument detection 608, for example, to compare performance results. Using the previously learned instrument network 401 with the flow diagram 600 can also compress model size for faster inference performance and improve overall accuracy by training ensembles.

[0091] FIG. 7 depicts a flow diagram 700 of automatic prediction of surgical states and instruments in surgical data using one or more machine-learning models according to one or more aspects. The one or more machine-learning models 702 can include detection ensembles 704 that each include a pairing of a feature encoder 712 and a task-specific decoder 716 after training by the flow diagram 400 of FIG. 4 and/or flow diagram 600 of FIG. 6. An input window 720 of input data can include one or more inputs, such as video data 705 from a video stream 703, spatial data 707, and sensor data 709 temporally associated with the video stream 703 of a surgical procedure. The one or more machinelearning models 702 can predict a state of the surgical procedure based on accessing the input window 720 of input data. The one or more machine-learning models 702 can also track one or more surgical instruments (e.g., surgical instruments 51 1 , 531 , 551 of FIG.

5) at least partially depicted in the video stream 703 based on the input window 720. The feature encoders 712 and task-specific decoders 714 can be trained as an ensemble to detect the state and the one or more surgical instruments by sharing extracted features associated with the state and the one or more surgical instruments between the feature encoders 712 and task-specific decoders 714. Detecting, by the one or more machinelearning models, one or more surgical instruments can include joint detection of the one or more surgical instruments and a surgical workflow at one or more levels of granularity, where the one or more levels of granularity include one or more of event detection, action detection, step detection, surgical objective detection, and/or phase detection.

[0092] A state indicator 718 and one or more surgical instrument indicators 708 can be output and temporally correlated with the video stream 703. For example, the state indicator 718 can be the state prediction, which can be linked to one or more frames 710 of the video stream 703. The one or more surgical instrument indicators 708 can be the tracked instrument information of the one or more surgical instruments and can be linked to one or more frames 710 of the video stream 703. Linking can include the addition of metadata, tags, overlays, or separately tracked relationship information. For example, when performing real-time inferences, only select frames 710 of the video stream 703 may be updated with or linked to the state indicator 718 and/or the one or more surgical instrument indicators 708. The update rate may depend on the processing capacity of the system 100 of FIG. 1. The output can be in other forms, such as a notification, a request, a visualization, a scheduling update, or other type of alert. [0093] During an inference phase, the one or more machine-learning models 702 can be input live surgical data 300, that has not been pre-processed. The one or more machine-learning models 702, in the inference phase, generate the predictions. The one or more machine-learning models 702 can also output corresponding confidence scores associated with the predictions.

[0094] The outputs the one or more machine-learning models 702 can be used by the output generator 160 to provide augmented visualization via the augmented reality devices 180. The augmented visualization can include the graphical overlays being overlaid on the corresponding features (anatomical structure, surgical instrument, etc.) in the image(s) 302.

[0095] The output generator 160 can also provide a user feedback via the alert output system 170 in some aspects. The user feedback can include highlighting using graphical overlays one or more portions of the image(s) 302 to depict proximity between the surgical instrument(s) and anatomical structure(s). Alternatively, or in addition, the user feedback can be displayed in any other manner, such as a message, an icon, etc., being overlaid on the image(s) 302.

[0096] In some aspects, to facilitate real-time performance, the input window 320 can be analyzed at a predetermined frequency, such as 5 times per second, 3 times per second, 10 times per second, etc. The analysis can result in identification of locations of anatomical structures and surgical instruments in the images 302 that are in the input window 320. It can be appreciated that the video of the surgical procedure includes images 302 that are between two successive input windows 320. For example, if the video is captured at 60 frames per second, and if the input window 320 includes 5 frames, and if the input window 320 is analyzed 5 times per second, then a total of 25 frames from the captured 60 frames are analyzed. The remaining 35 frames are in between two successive input windows 320. It is understood that the capture speed, input window frequency, and other parameters can vary from one aspect to another, and that above numbers are examples. [0097] For the frames, e.g., including images 302, between two successive input windows 320, the locations of the anatomical structures and surgical instruments can be predicted based on the locations predicted in the most recent input window 320. For example, a movement vector of the surgical instrument can be computed based on the changes in the location of the surgical instrument in the frames in the prior input window 320. The movement vector can be computed using a machine learning model, such as a deep neural network. The movement vector is used to predict the location of the surgical instrument in the subsequent frames after the input window 320, until a next input window 320 is analyzed. [0098] The location of structure(s) predicted by the one or more machine-learning models 702 can also be predicted in the frames between two successive input windows 320 in the same manner. Graphical overlays that are used to overlay the images 302 to represent predicted features (e.g., surgical instruments, anatomical structures, etc.) are accordingly adjusted, if required, based on the predicted locations. Accordingly, a smooth visualization, in real time, is provided to the user with lesser computing resources being used. In some aspects, the graphical overlays can be configured to be switched off by the user, for example, the surgeon, and the system works without overlays, rather only generating the overlays and/or other types of user feedback when an alert is to be provided (e.g., instrument within predetermined vicinity of an anatomical structure). [0099] Aspects of the technical solutions described herein can improve surgical procedures by improving the safety of the procedures. Further, the technical solutions described herein facilitate improvements to computing technology, particularly computing techniques used during a surgical procedure. Aspects of the technical solutions described herein facilitate one or more machine-learning models, such as computer vision models, to process images obtained from a live video feed of the surgical procedure in real-time using spatio-temporal information. The machine-learning models using techniques such as neural networks to use information from the live video feed and (if available) robotic sensor platform to predict one or more features, such as anatomical structures, surgical instruments, in an input window of the live video feed, and further refine the predictions using additional machine-learning models that can predict a state of the surgical procedure. The machine-learning models can be trained to identify the surgical state(s) of the procedure and instruments in the field of view by learning from raw image data and instrument markers (bounding boxes, lines, key points, etc.). When in a robotic procedure, the computer vision models can also accept sensor information (e.g., instruments enabled, mounted, etc.) to improve the predictions. Computer Vision models that predict instruments and critical anatomical structures use temporal information from the state prediction models to improve the confidence of the predictions in real-time.

[0100] The predictions and the corresponding confidence scores can be used to generate and display graphical overlays to the surgeon and/or other users in an augmented visualization of the surgical view. The graphical overlays can mark critical anatomical structures, surgical instruments, surgical staples, scar tissue, results of previous surgical actions, etc. The graphical overlays can further show a relationship between the surgical instrument(s) and one or more anatomical structures in the surgical view and thus, guide the surgeon and other users during the surgery. The graphical overlays are adjusted according to the user’s preferences and/or according to the confidence scores of the predictions. Aspects of the technical solutions described herein provide a practical application in surgical procedures.

[0101] Further yet, aspects of the technical solutions described herein address technical challenges of predicting complex features in a live video feed of a surgical view in realtime. The technical challenges are addressed by using a combination of various machine learning techniques to analyze multiple images in the video feed. Further yet, to address the technical challenge of real-time analysis and augmented visualization of the surgical view, aspects of the technical solutions described herein predict the present state of the surgical view at a constant frame rate and update the present state using the machinelearning models at a predetermined frame rate. [0102] It should be noted that although some of the drawings depict endoscopic videos being analyzed, the technical solutions described herein can be applied to analyze video and image data captured by cameras that are not endoscopic (i.e., cameras external to the patient’s body) when performing open surgeries (i.e., not laparoscopic surgeries). For example, the video and image data can be captured by cameras that are mounted on one or more personnel in the operating room, e.g., surgeon. Alternatively, or in addition, the cameras can be mounted on surgical instruments, walls, or other locations in the operating room.

[0103] Turning now to FIG. 8, a computer system 800 is generally shown in accordance with an aspect. The computer system 800 can be an electronic, computer framework comprising and/or employing any number and combination of computing devices and networks utilizing various communication technologies, as described herein. The computer system 800 can be easily scalable, extensible, and modular, with the ability to change to different services or reconfigure some features independently of others. The computer system 800 may be, for example, a server, desktop computer, laptop computer, tablet computer, or smartphone. In some examples, computer system 800 may be a cloud computing node. Computer system 800 may be described in the general context of computer executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, and so on that perform particular tasks or implement particular abstract data types. Computer system 800 may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media, including memory storage devices.

[0104] As shown in FIG. 8, the computer system 800 has one or more central processing units (CPU(s)) 801a, 801b, 801c, etc. (collectively or generically referred to as processor(s) 801). The processors 801 can be a single-core processor, multi-core processor, computing cluster, or any number of other configurations. The processors 801 , also referred to as processing circuits, are coupled via a system bus 802 to a system memory 803 and various other components. The system memory 803 can include one or more memory devices, such as a read-only memory (ROM) 804 and a random access memory (RAM) 805. The ROM 804 is coupled to the system bus 802 and may include a basic input/output system (BIOS), which controls certain basic functions of the computer system 800. The RAM is read-write memory coupled to the system bus 802 for use by the processors 801. The system memory 803 provides temporary memory space for operations of said instructions during operation. The system memory 803 can include random access memory (RAM), read-only memory, flash memory, or any other suitable memory systems.

[0105] The computer system 800 comprises an input/output (I/O) adapter 806 and a communications adapter 807 coupled to the system bus 802. The I/O adapter 806 may be a small computer system interface (SCSI) adapter that communicates with a hard disk 808 and/or any other similar component. The I/O adapter 806 and the hard disk 808 are collectively referred to herein as a mass storage 810.

[0106] Software 81 1 for execution on the computer system 800 may be stored in the mass storage 810. The mass storage 810 is an example of a tangible storage medium readable by the processors 801, where the software 81 1 is stored as instructions for execution by the processors 801 to cause the computer system 800 to operate, such as is described hereinbelow with respect to the various Figures. Examples of computer program product and the execution of such instruction is discussed herein in more detail. The communications adapter 807 interconnects the system bus 802 with a network 812, which may be an outside network, enabling the computer system 800 to communicate with other such systems. In one aspect, a portion of the system memory 803 and the mass storage 810 collectively store an operating system, which may be any appropriate operating system to coordinate the functions of the various components shown in FIG. 8. [0107] Additional input/output devices are shown as connected to the system bus 802 via a display adapter 815 and an interface adapter 816 and. In one aspect, the adapters 806, 807, 815, and 816 may be connected to one or more I/O buses that are connected to the system bus 802 via an intermediate bus bridge (not shown). A display 819 (e.g., a screen or a display monitor) is connected to the system bus 802 by a display adapter 815, which may include a graphics controller to improve the performance of graphicsintensive applications and a video controller. A keyboard, a mouse, a touchscreen, one or more buttons, a speaker, etc., can be interconnected to the system bus 802 via the interface adapter 816, which may include, for example, a Super I/O chip integrating multiple device adapters into a single integrated circuit. Suitable I/O buses for connecting peripheral devices such as hard disk controllers, network adapters, and graphics adapters typically include common protocols, such as the Peripheral Component Interconnect (PCI). Thus, as configured in FIG. 8, the computer system 800 includes processing capability in the form of the processors 801, and, storage capability including the system memory 803 and the mass storage 810, input means such as the buttons, touchscreen, and output capability including the speaker 823 and the display 819.

[0108] In some aspects, the communications adapter 807 can transmit data using any suitable interface or protocol, such as the internet small computer system interface, among others. The network 812 may be a cellular network, a radio network, a wide area network (WAN), a local area network (LAN), or the Internet, among others. An external computing device may connect to the computer system 800 through the network 812. In some examples, an external computing device may be an external web server or a cloud computing node.

[0109] It is to be understood that the block diagram of FIG. 8 is not intended to indicate that the computer system 800 is to include all of the components shown in FIG.

8. Rather, the computer system 800 can include any appropriate fewer or additional components not illustrated in FIG. 8 (e.g., additional memory components, embedded controllers, modules, additional network interfaces, etc.). Further, the aspects described herein with respect to computer system 800 may be implemented with any appropriate logic, wherein the logic, as referred to herein, can include any suitable hardware (e.g., a processor, an embedded controller, or an application-specific integrated circuit, among others), software (e.g., an application, among others), firmware, or any suitable combination of hardware, software, and firmware, in various aspects.

[0110] FIG. 9 depicts a surgical procedure system 900 in accordance with one or more aspects. The example of FIG. 9 depicts a surgical procedure support system 902 configured to communicate with a surgical procedure scheduling system 930 through a network 920. The surgical procedure support system 902 can include or may be coupled to the system 100 of FIG. 1 . The surgical procedure support system 902 can acquire image data, such as images 302 of FIG. 3, using one or more cameras 904. The surgical procedure support system 902 can also interface with a plurality of sensors 906 and effectors 908. The sensors 906 can produce sensor data 309 of FIG. 3 and/or spatial data 307 of FIG. 3. The sensors 906 may be associated with surgical support equipment and/or patient monitoring. The effectors 908 can be robotic components or other equipment controllable through the surgical procedure support system 902. The surgical procedure support system 902 can also interact with one or more user interfaces 910, such as various input and/or output devices. The surgical procedure support system 902 can store, access, and/or update surgical data 914 associated with a training dataset and/or live data as a surgical procedure is being performed. The surgical procedure support system 902 can store, access, and/or update surgical objectives 916 to assist in training and guidance for one or more surgical procedures.

[0111] The surgical procedure scheduling system 930 can access and/or modify scheduling data 932 used to track planned surgical procedures. The scheduling data 932 can be used to schedule physical resources and/or human resources to perform planned surgical procedures. Based on the surgical state as predicted by the one or more machine-learning models 702 of FIG. 7 and a current operational time, the surgical procedure support system 902 can estimate an expected time for the end of the surgical procedure. This can be based on previously observed similarly complex cases with records in the surgical data 914. A change in a predicted end of the surgical procedure can be used to inform the surgical procedure scheduling system 930 to prepare the next patient, which may be identified in a record of the scheduling data 932. The surgical procedure support system 902 can send an alert to the surgical procedure scheduling system 930 that triggers a scheduling update associated with a later surgical procedure. The change in scheduling can be captured in the scheduling data 932. Predicting an end time of the surgical procedure can increase efficiency in operating rooms that run parallel sessions, as resources can be distributed between the operating rooms. Requests to be in an operating room can be transmitted as one or more notifications 934 based on the scheduling data 932 and the predicted surgical state.

[0112] As surgical states and steps are completed, progress can be tracked in the surgical data 914 and status can be displayed through the user interfaces 910. Status information may also be reported to other systems through the notifications 934 as surgical states and steps are completed or if any issues are observed, such as complications.

[0113] FIG. 10 depicts an example of a CAS system 1000 according to one or more aspects. The CAS system 1000 includes at least a computing system 1002, a video recording system 1004, and a surgical instrumentation system 1006. The computing system 1002 can include one or more instances of the computer system 800 of FIG. 8.

The CAS system 1000 can include aspects of the surgical procedure system 900 of FIG. 9 and may implement a portion or all of the system 100 of FIG. 1 .

[0114] With respect to FIG. 10, an actor 1012 can be medical personnel that uses the CAS system 1000 to perform a surgical procedure on a patient 1010. Medical personnel can be a surgeon, assistant, nurse, administrator, or any other actor that interacts with the CAS system 1000 in a surgical environment. The surgical procedure can be any type of surgery such as but not limited to cataract surgery, laparoscopic cholecystectomy, endoscopic endonasal transsphenoidal approach (eTSA) to resection of pituitary adenomas, or any other surgical procedure. In other examples, the actor 1012 can be a technician, and administrator, an engineer, or any other such personnel that interacts with the CAS system 1000. For example, the actor 1012 can record data from the CAS system 1000, configure/update one or more attributes of the CAS system 1000, review past performance of the CAS system 1000, repair the CAS system 1000, etc.

[0115] A surgical procedure can include multiple phases, and each phase can include one or more surgical actions. A “surgical action” can include an incision, a compression, a stapling, a clipping, a suturing, a cauterization, a sealing, or any other such actions performed to complete a phase in the surgical procedure. A “phase” represents a surgical event that is composed of a series of steps (e.g., closure). A “step” refers to the completion of a named surgical objective (e.g., hemostasis). During each step, certain surgical instruments 1008 (e.g., forceps) are used to achieve a specific objective by performing one or more surgical actions.

[0116] The surgical instrumentation system 1006 can provide electrical energy to operate one or more surgical instruments 1008 to perform the surgical actions. The electrical energy triggers an activation in the surgical instrument 1008. The electrical energy can be provided in the form of an electrical current or an electrical voltage. The activation can cause a surgical action to be performed. The surgical instrumentation system 1006 can further include electrical energy sensors, electrical impedance sensors, force sensors, bubble and occlusion sensors, and various other types of sensors. The electrical energy sensors can measure and indicate an amount of electrical energy applied to one or more surgical instruments 1008 being used for the surgical procedure. The impedance sensors can indicate an amount of impedance measured by the surgical instruments 1008, for example, from the tissue being operated upon. The force sensors can indicate an amount of force being applied by the surgical instruments 1008. Measurements from various other sensors, such as position sensors, pressure sensors, flow meters, can also be input. The sensors can include, for example, sensors 906 of FIG. 9. The one or more surgical instruments 51 1, 531, 551 of FIG. 5 are examples of the surgical instruments 1008. The surgical instrumentation system 1006 can control the surgical instruments 1008 using, for example, the effectors 908 of FIG. 9.

[0117] The video recording system 1004 includes one or more cameras 1005, such as operating room cameras, endoscopic cameras, etc. Cameras 1005 can include the cameras 904 of FIG. 9. The cameras 1005 capture video data of the surgical procedure being performed. The video recording system 1004 includes one or more video capture devices that can include cameras 1005 placed in the surgical room to capture events surrounding (i.e., outside) the patient being operated upon. The video recording system 1004 can further include cameras 1005 that are passed inside (e.g., endoscopic cameras) the patient to capture endoscopic data. The endoscopic data provides video, images of the surgical procedure (e.g., FIG. 13).

[0118] The computing system 1002 includes one or more memory devices, one or more processors, user interface devices, and other such computer components. The computing system 1002 can execute one or more computer executable instructions. The execution of the instructions facilitates the computing system 1002 to perform one or more methods, including those described herein. The computing system 1002 can communicate with other computing systems via a wired and/or a wireless network. In one or more examples, the computing system 1002 includes one or more trained machine learning models that can detect and/or predict features of/from the surgical procedure that is being performed or has been performed earlier. Examples of the machine learning models can include trained machine-learning models 130 of FIG. 1 , machine-learning models 702 of FIG. 7, and other such models as described herein. Features can include structures such as anatomical structures, surgical instruments 1008 in the surgical procedure. Features can further include events such as phases, actions in the surgical procedure. Features that are detected can further include features relating to the actor 1012 and/or patient 1010. Based on the detection, the computing system 1002, in one or more examples, can provide recommendations for subsequent actions to be taken by the actor 1012. Alternatively, or in addition, the computing system 1002 can provide one or more reports based on the detections. The detections by the machine learning models can be performed in an autonomous or semi-autonomous manner.

[0119] The machine learning models can include artificial neural networks, such as deep neural networks, convolutional neural networks, recurrent neural networks, encoders, decoders, or any other type of machine learning models. The machine learning models can be trained in a supervised, unsupervised, or hybrid manner. The machine learning models can be trained to perform detection and/or prediction using one or more types of data acquired by the CAS system 1000. For example, the machine learning models can use the video data captured via the video recording system 1004. Alternatively, or in addition, the machine learning models use the surgical instrumentation data from the surgical instrumentation system 1006. In yet other examples, the machine learning models use a combination of the video and the surgical instrumentation data.

[0120] Additionally, in some examples, the machine learning models can also use audio data captured during the surgical procedure. The audio data can include sounds emitted by the surgical instrumentation system 1006 while activating one or more surgical instruments 1008. Alternatively, or in addition, the audio data can include voice commands, snippets, or dialog from one or more actors 1012. The audio data can further include sounds made by the surgical instruments 1008 during their use. [0121] In one or more examples, the machine learning models can detect surgical actions, surgical phases, anatomical structures, surgical instruments, and various other features from the data associated with a surgical procedure. The detection can be performed in real-time in some examples. Alternatively, or in addition, the computing system 1002 analyzes the surgical data, i.e., the various types of data captured during the surgical procedure, in an offline manner (e.g., post-surgery).

[0122] A technical challenge with using CAS system 1000 is to determine how the surgical instruments 1008 interact with anatomy, and particular movements that have been performed by the surgical instruments 1008. For example, when performing surgical actions such as grasping, cutting, etc. the tips may be kept open/closed, leading to different outcomes. Identifying, and further recording the particular motion profiles used during particular surgical procedures, or surgical actions can be beneficial. For example, such information can be used when a similar case arises in the future. Such information, again, can be used for post-operative detailed analysis. For example, a motion profile used during several instances of a surgical procedure can be correlated with the outcomes. Additionally, the recorded motion profiles can be used to train students/interns and other personnel to improve their skills when performing such surgical procedures. Such information can also be used in real time, for example, to prevent potential damage to tissue with early-warning systems.

[0123] Detecting and recording trajectories and motion of the surgical instruments 1008 over time during the surgical procedure is a technical challenge. Technical solutions described herein address such technical challenges by detecting real-time granular information of the positions and orientations of the surgical instrument 1008 and generating a motion profile of each surgical instrument 1008. The motion profile of the surgical instrument 1008 can be generated using machine learning in one or more examples. Further, in one or more examples, a three-dimensional (3D) graph over time can be generated that can highlight the different motion profiles of a surgical instrument 1008 in 3D space over a period of time. In some aspects, the motion profiles can also track orientation of the surgical instrument 1008 during use. The 3D graph can include at least one anatomical structure in combination with the motion profile to assist in understanding location of the surgical instrument 1008.

[0124] FIG. 1 1 depicts example of a 3D graph 1 100 of motion profiles 1 102 generated using one or more examples. The motion profiles 1 102 of several surgical instruments 1008 used during the surgical procedure are generated, for example, by the computing system 1002 of FIG. 10. Each motion profile 1 102 is depicted using a distinct visual attribute. As one example, color can be used to distinguish between each motion profile 1102. However, other attributes can be used in other examples, such as dashes, grayscale, thickness, etc. The 3D graph 1100 and associated motion profiles 1102 can be captured and recorded as part of operative notes in electronic medical records associated with the patient 1010.

[0125] In one or more examples, a surgical video 1104 of a surgery being performed can be played back in the background with the motion profiles 1102 being shown concurrently as an overlay on a display device, such as display 819 of FIG. 8. For example, the 3D graph 1100 can be generated as a 3D user interactive element that includes the video playback, with the motion profiles 1102 being displayed as the surgical instruments 1008 are used in the surgical video 1104. The video playback and the motion profile playback are temporally synchronized for generating such a view.

[0126] The 3D graph 1100 can be manipulated by the user to rotate, translate, zoom in, zoom out, etc. to get a different view of the motion profiles and/or the video playback. The 3D graph can be partitioned into a plurality of time slices 1106 that can assist with visualizing a time component and selecting frames to view for post-operative analysis. For instance, where a position or change in one or more of the motion profiles 1 102 is of interest, a user can select a time slice 1106 at or before the time of interest to start video playback at a time where the event occurs in the surgical video 1104. The time slices 1106 can be displayed and selectable to change a point in time of video playback, and the motion profiles 1102 can extend through two or more of the time slices 1106. The motion profiles 1102 can appear as continuous positions through multiple time slices 1106. The time slices 1 106 may be depicted using a perspective view of the 3D graph 1100 to illustrate a time dimension.

[0127] Further, the 3D graph 1 100 can include different enhancements to depict one or more anatomical structures 1108 that are detected in the surgical video 1104. For example, in FIG. 11, which depicts a cataract surgery being performed, the center of a pupil is highlighted in purple as a detected anatomical structure 1108. It is understood that although a cataract surgery being performed is shown in the FIG. 11, the technical solutions described herein can be used to generate motion profiles 1102 of surgical instruments 1008 during any other type of surgical procedures. A user interface that displays the 3D graph 1 100 can include other elements such as an annotation of surgical phase 1110 and/or a legend 11 12 that assists in identifying highlighted features in the 3D graph 1100. Motion profiles 1102 can be defined at a more granular level to track separate features of a same surgical instrument, such as multiple prongs of a gripping or cutting tool. In some aspects, the 3D graph 1 100 can be rotated based on a user selection, and multiple instances of the 3D graph 1 100 can be displayed concurrently.

[0128] FIG. 12 depicts an example of a process 1200 of using machine learning to generate motion profiles 1102 of surgical instruments 1008 according to one or more examples. The process 1200 can be performed by the CAS system 1000 of FIG. 10 or other systems as described herein.

[0129] At block 1202, input images 1204 from the surgical video 1104 are analyzed. The analysis can use, for example, the one or more machine-learning models 702 of FIG. 7. Alternatively, the analysis can be independent of the one or more machine-learning models 702 to estimate the anatomical structures and surgical instruments. For instance, the analysis can include normalizing one or more image attributes to adjust for lighting effects, camera position, or other such factors. At block 1206, segmentation can be performed on the input images 1204. For example, machine learning 1208 such as, artificial neural networks (e.g., convolutional neural network, deep neural network, etc.) can be used to segment the input data to estimate the features. In addition to the video/images that are captured, instrument detection can also be based on sensor data from the surgical instrumentation system 1006. At block 1210, tool and anatomy location estimation are performed. The output of machine learning 1208 can include feature masks that identify likely features in the input images 1204. For example, machine learning 1208 can include classification tuned to multiple features, such as surgical instruments 1008, surgical structures (e.g., sutures), and anatomical structures 1108. Regions of estimated features 1212, such as background features 1214, surgical tool features 1216, surgical structure features 1218, and anatomical structure features 1220, 1222 can be tracked but need not be displayed. Further, based on the estimated features, i.e., anatomy and surgical instruments 1008, at block 1224 a visualization can be generated that includes the 3D graph 1 100 with motion profiles 1 102. It will be appreciated that the process 1200 can include additional steps beyond those depicted in the example of FIG. 12.

[0130] FIG. 13 depicts an example of a machine learning model 1300 for detecting the features from the input visual data from during the surgical procedure. The machine learning model 1300 can be used to perform segmentation as part of the machine learning 1208 of FIG. 12. In one or more examples, a neural network architecture, such as a variant of FIRNet, can be used for high-resolution segmentation of the features. The machine learning model 1300 can be a neural network trained using previously annotated training data that includes surgical video, such as other instances of surgical video 1104 from previously performed surgical procedures. For example, the machine learning model 1300 can be trained using thousands of images from different clinical setups. In some examples, post-processing filtering is applied to handle misclassification in isolated pixels.

[0131] As one example, the machine learning model 1300 can include a plurality of feature maps 1302, 1304, 1306 that vary in scale and depth to perform segmentation of features with multi-resolution segmentation. The first scale feature maps 1302 can receive an input frame 1301 from the input images 1204 of FIG. 12. Features of the first scale feature maps 1302 can be downsampled to produce the second scale feature maps 1304, and the second scale feature maps 1304 can be further downsampled to produce the third scale feature maps 1306. The first scale feature maps 1302 have a higher resolution than the second scale feature maps 1304, and the second scale feature maps 1304 have a higher resolution than the third scale feature maps 1306. The first scale feature maps 1302 can be convolved 1310 with the second scale feature maps 1304, the second scale feature maps 1304 can be convolved 1312 with the third scale maps 1306, and the first scale feature maps 1302 can be convolved 1314 with the third scale maps 1306. Intermediate results after performing convolutions with weighting and/or other operations at each of the first scale, second scale, and third scale can be recombined, for instance using upsampling. For example, the second scale feature maps 1304 can be convolved 1316 with the first scale feature maps 1302, the third scale feature maps 1306 can be convolved 1318 with the first scale feature maps 1302, and the third scale feature maps 1306 can be convolved 1320 with the second scale feature maps 1304. The resulting segmentation map 1322 can segment the input frame 1301 into classes or regions representing predicted features, such as regions of estimated features 1212 of FIG. 12. For example, multi-resolution segmentation performed by the machine learning model 1300 can produce the segmentation map 1322 of a plurality of regions of estimated features including at least a portion of the one or more surgical instruments 1008, for instance, as surgical tool features 1216. Detailed classification can more precisely identify specific surgical instruments 1008 based at least in part on the surgical tool features 1216. Changes in location of a feature, such as a tool tip, can be tracked over a period of time to produce the motion profiles 1102. Sensor data can also further assist in classifying various regions of the segmentation map 1322.

[0132] The technical solutions described herein are not limited to any particular type of instrument, and the motion profiles can be generated for any type of surgical instrument 1008. Generating, recording, and displaying the motion profiles allows for a better understanding of tool positioning with respect to particular anatomical structures, such as the center of the pupil. An intuitive visualization tool, such as the one provided by the technical solutions herein, highlight motion of the surgical instruments 1008 in time, across surgical phases. Such information can be a driver of further surgical skill analysis for novice and expert surgeons.

[0133] In yet more examples, the motion profiles 1 102 can be used to provide real time guidance during surgery. For example, if the motion of the surgical instrument 1008 is beyond one or more thresholds, a warning can be provided to the operator. For example, a predetermined path of the surgical instrument 1008 can be configured for a surgical procedure. Alternatively, or in addition, the path for a surgical instrument 1008 can be configured per phase. If, during the surgical procedure, the surgical instrument 1008 is detected to veer off of the predetermined path by more than a predetermined threshold, a user feedback can be provided. Alternatively, or in addition, user feedback can also be provided to indicate that the predetermined path is being followed, for instance, based on detecting that the surgical instrument 1008 is following a predetermined path within a predetermined threshold.

[0134] The technical solutions described herein can use endoscopic video of the surgical procedure in one or more examples. In other examples, the technical solutions described herein use open surgery video, for example, with cameras 1005 mounted on the surgeon’s head. Cameras 1005 can be mounted at various other locations around the operating room in other examples.

[0135] The present invention may be a system, a method, and/or a computer program product at any possible technical detail level of integration. The computer program product may include a computer-readable storage medium (or media) having computer- readable program instructions thereon for causing a processor to carry out aspects of the present invention.

[0136] The computer-readable storage medium can be a tangible device that can retain and store instructions for use by an instruction execution device. The computer-readable storage medium may be, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non- exhaustive list of more specific examples of the computer-readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer-readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.

[0137] Computer-readable program instructions described herein can be downloaded to respective computing/processing devices from a computer-readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network, and/or a wireless network. The network may comprise copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers. A network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer- readable program instructions for storage in a computer-readable storage medium within the respective computing/processing device.

[0138] Computer-readable program instructions for carrying out operations of the present invention may be assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state-setting data, configuration data for integrated circuitry, or either source-code or object code written in any combination of one or more programming languages, including an object-oriented programming language such as Smalltalk, C++, or the like, and procedural programming languages, such as the “C” programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet Service Provider). In some aspects, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) may execute the computer-readable program instruction by utilizing state information of the computer-readable program instructions to personalize the electronic circuitry, in order to perform aspects of the present invention.

[0139J Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to aspects of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

[0140] These computer-readable program instructions may be provided to a processor of a general-purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, a programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable storage medium having instructions stored therein comprises an article of manufacture including instructions which implement aspects of the function/act specified in the flowchart and/or block diagram block or blocks.

[0141] The computer-readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other device to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer-implemented process, such that the instructions which execute on the computer, other programmable apparatus, or other device implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

[0142] The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various aspects of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the blocks may occur out of the order noted in the Figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts or carry out combinations of special purpose hardware and computer instructions.

[0143] The descriptions of the various aspects of the present invention have been presented for purposes of illustration but are not intended to be exhaustive or limited to the aspects disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described aspects. The terminology used herein was chosen to best explain the principles of the aspects, the practical application or technical improvement over technologies found in the marketplace, or to enable others of ordinary skill in the art to understand the aspects described herein. [0144] Various aspects of the invention are described herein with reference to the related drawings. Alternative aspects of the invention can be devised without departing from the scope of this invention. Various connections and positional relationships (e.g., over, below, adjacent, etc.) are set forth between elements in the following description and in the drawings. These connections and/or positional relationships, unless specified otherwise, can be direct or indirect, and the present invention is not intended to be limiting in this respect. Accordingly, a coupling of entities can refer to either a direct or an indirect coupling, and a positional relationship between entities can be a direct or indirect positional relationship. Moreover, the various tasks and process steps described herein can be incorporated into a more comprehensive procedure or process having additional steps or functionality not described in detail herein.

[0145] The following definitions and abbreviations are to be used for the interpretation of the claims and the specification. As used herein, the terms “comprises,” “comprising,” “includes,” “including,” “has,” “having,” “contains," or “containing,” or any other variation thereof are intended to cover a non-exclusive inclusion. For example, a composition, a mixture, process, method, article, or apparatus that comprises a list of elements is not necessarily limited to only those elements but can include other elements not expressly listed or inherent to such composition, mixture, process, method, article, or apparatus.

[0146] Additionally, the term “exemplary” is used herein to mean “serving as an example, instance or illustration.” Any aspect or design described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other aspects or designs. The terms “at least one” and “one or more” may be understood to include any integer number greater than or equal to one, i.e., one, two, three, four, etc. The terms “a plurality” may be understood to include any integer number greater than or equal to two, i.e., two, three, four, five, etc. The term “connection” may include both an indirect “connection” and a direct “connection.” [0147] The terms “about,” “substantially,” “approximately,” and variations thereof are intended to include the degree of error associated with measurement of the particular quantity based upon the equipment available at the time of filing the application. For example, “about” can include a range of ± 8% or 5%, or 2% of a given value.

[0148] For the sake of brevity, conventional techniques related to making and using aspects of the invention may or may not be described in detail herein. In particular, various aspects of computing systems and specific computer programs to implement the various technical features described herein are well known. Accordingly, in the interest of brevity, many conventional implementation details are only mentioned briefly herein or are omitted entirely without providing the well-known system and/or process details.

[0149] It should be understood that various aspects disclosed herein may be combined in different combinations than the combinations specifically presented in the description and accompanying drawings. It should also be understood that, depending on the example, certain acts or events of any of the processes or methods described herein may be performed in a different sequence, may be added, merged, or left out altogether (e.g., all described acts or events may not be necessary to carry out the techniques). In addition, while certain aspects of this disclosure are described as being performed by a single module or unit for purposes of clarity, it should be understood that the techniques of this disclosure may be performed by a combination of units or modules associated with, for example, a medical device.

[0150] In one or more examples, the described techniques may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. Computer-readable media may include non-transitory computer-readable media, which corresponds to a tangible medium such as data storage media (e.g., RAM, ROM, EEPROM, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer). [0151] Instructions may be executed by one or more processors, such as one or more digital signal processors (DSPs), general purpose microprocessors, application specific integrated circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Accordingly, the term “processor” as used herein may refer to any of the foregoing structure or any other physical structure suitable for implementation of the described techniques. Also, the techniques could be fully implemented in one or more circuits or logic elements.