Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM AND METHOD FOR RECOGNIZING ACTIVITIES
Document Type and Number:
WIPO Patent Application WO/2019/152489
Kind Code:
A1
Abstract:
A surveillance system includes a camera positioned to capture video images of an area being surveilled, a computer including an activity detector trained to detect primitive activities and attributes, and an index including a plurality of video clips and a textual description associated with each video clip. The activity detector is operable to generate the plurality of video clips and the textual description by separating video clips including a primitive activity from the captured video images. The associated textual description describes one of a single primitive activity and an attribute of the video clip. A search engine within the computer is operable to search the plurality of video clips based on a textual input that includes at least one non-primitive activity.

Inventors:
KARANAM SRIKRISHNA (US)
PENG KUAN-CHUAN (US)
WU ZIYAN (US)
CHANG TI-CHIUN (US)
ERNST JAN (US)
Application Number:
PCT/US2019/015806
Publication Date:
August 08, 2019
Filing Date:
January 30, 2019
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
SIEMENS AG (DE)
SIEMENS CORP (US)
International Classes:
G06K9/00
Other References:
ZHENG XU ET AL: "Semantic based representing and organizing surveillance big data using video structural description technology", JOURNAL OF SYSTEMS & SOFTWARE, vol. 102, 1 April 2015 (2015-04-01), US, pages 217 - 225, XP055575712, ISSN: 0164-1212, DOI: 10.1016/j.jss.2014.07.024
Attorney, Agent or Firm:
OTTERLEE, Thomas J. (US)
Download PDF:
Claims:
CLAIMS

What is claimed is:

1. A surveillance system comprising:

a camera positioned to capture video images of an area being surveilled;

a computer including an activity detector trained to detect primitive activities and attributes;

an index including a plurality of video clips and a textual description associated with each video clip, the activity detector operable to generate the plurality of video clips and the textual description by separating video clips including a primitive activity from the captured video images, wherein the associated textual description describes one of a single primitive activity and an attribute of the video clip; and

a search engine within the computer, the search engine operable to search the plurality of video clips based on a textual input that includes at least one non-primitive activity.

2. The surveillance system of claim 1, further comprising an input device connected to the computer and operable to facilitate the input of the textual input.

3. The surveillance system of claim 1, wherein each video clip includes at least one primitive activity.

4. The surveillance system of claim 1, wherein the textual input includes at least one non-primitive activity and at least one attribute to be searched.

5. The surveillance system of claim 4, wherein the computer breaks each of the non primitive activities into a search string of primitive activities separated by logical operators.

6. A method of surveilling an area, the method comprising:

positioning a video camera to capture a video image of the area;

training an activity detector to detect primitive activities using data containing examples of those primitive activities;

separating the video image into a plurality of video clips using the activity detector, each clip including at least one primitive activity;

adding a textual description to each of the plurality of video clips, the combination of the video clips and the textual description defining an index;

providing a textual input including a non-primitive activity;

breaking the textual input into a plurality of searched primitive activities; and searching the index for a video clip including each of the searched primitive activities.

7. The method of claim 6, wherein each textual description includes a single primitive activity.

8. The method of claim 7, wherein each textual description includes the single primitive activity and an attribute.

9. The method of claim 6, wherein each textual description includes an attribute.

10. The method of claim 9, wherein the attribute includes a time stamp indicative of the time span covered by the video clip.

11. The method of claim 6, wherein a first video clip includes a plurality of primitive activities and a plurality of attributes, and wherein each primitive activity and each attribute is a separate textual description within the index.

12. A method of surveilling an area, the method comprising:

positioning a video camera to capture a video image of the area;

using an activity detector to separate the video image into a plurality of video clips each including at least one primitive activity;

assigning a textual description to each video clip, the textual description indicative of one of the at least one primitive activities;

assigning attributes to each of the video clips;

storing the video clips, the associated attributes, and the associated textual descriptions in an index;

providing a textual search string including at least one non-primitive activity and at least one attribute;

separating the textual search string into primitive activities and attributes separated by logical operators to define a separated search string; and

searching the index using the separated search string to find video clips that match the textual search string.

13. The method of claim 12, wherein the attribute includes a time stamp indicative of the time span covered by the video clip.

14. The method of claim 12, wherein a first video clip includes a plurality of primitive activities and a plurality of attributes, and wherein each primitive activity and each attribute is a separate textual description within the index.

15. The method of claim 12, wherein the primitive activity includes a person running.

Description:
SYSTEM AND METHOD FOR RECOGNIZING ACTIVITIES

TECHNICAL FIELD

[0001] The present disclosure is directed, in general, to a system and method for detecting particular activities, and more specifically to such a system and method that detects and applies a textual description to activities that have not been seen before.

BACKGROUND

[0002] Recognizing suspicious activities is a critical aspect of video surveillance systems. Most modern activity recognition algorithms are trained with a known set of activities and

consequently are capable of only recognizing activities that belong to this set.

SUMMARY

[0003] A surveillance system includes a camera positioned to capture video images of an area being surveilled, a computer including an activity detector trained to detect primitive activities and attributes, and an index including a plurality of video clips and a textual description associated with each video clip. The activity detector is operable to generate the plurality of video clips and the textual description by separating video clips including a primitive activity from the captured video images. The associated textual description describes one of a single primitive activity and an attribute of the video clip. A search engine within the computer is operable to search the plurality of video clips based on a textual input that includes at least one non-primitive activity.

[0004] In another construction, a method of surveilling an area includes positioning a video camera to capture a video image of the area, training an activity detector to detect primitive activities using data containing examples of those primitive activities, and separating the video image into a plurality of video clips using the activity detector, each clip including at least one primitive activity. The method further includes adding a textual description to each of the plurality of video clips, the combination of the video clips and the textual description defining an index, providing a textual input including a non-primitive activity, breaking the textual input into a plurality of searched primitive activities, and searching the index for a video clip including each of the searched primitive activities.

[0005] In another construction, a method of surveilling an area includes positioning a video camera to capture a video image of the area, using an activity detector to separate the video image into a plurality of video clips each including at least one primitive activity, assigning a textual description to each video clip, the textual description indicative of one of the at least one primitive activities, and assigning attributes to each of the video clips. The method also includes storing the video clips, the associated attributes, and the associated textual descriptions in an index, providing a textual search string including at least one non-primitive activity and at least one attribute, separating the textual search string into primitive activities and attributes separated by logical operators to define a separated search string, and searching the index using the separated search string to find video clips that match the textual search string.

[0006] The foregoing has outlined rather broadly the technical features of the present disclosure so that those skilled in the art may better understand the detailed description that follows.

Additional features and advantages of the disclosure will be described hereinafter that form the subject of the claims. Those skilled in the art will appreciate that they may readily use the conception and the specific embodiments disclosed as a basis for modifying or designing other structures for carrying out the same purposes of the present disclosure. Those skilled in the art will also realize that such equivalent constructions do not depart from the spirit and scope of the disclosure in its broadest form.

[0007] Also, before undertaking the Detailed Description below, it should be understood that various definitions for certain words and phrases are provided throughout this specification and those of ordinary skill in the art will understand that such definitions apply in many, if not most, instances to prior as well as future uses of such defined words and phrases. While some terms may include a wide variety of embodiments, the appended claims may expressly limit these terms to specific embodiments. BRIEF DESCRIPTION OF THE DRAWINGS

[0008] Fig. 1 is a schematic illustration of a video surveillance system.

[0009] Fig. 2 is a schematic illustration of the operation of a server or computer of the video surveillance system of Fig. 1.

[0010] Fig. 3 is a schematic illustration of a training process or a deep learning process for a neural network or other AI system.

[0011] Fig. 4 is a schematic illustration of a search operation using the video surveillance system of Fig. 1.

[0012] Before any embodiments of the invention are explained in detail, it is to be understood that the invention is not limited in its application to the details of construction and the arrangement of components set forth in the following description or illustrated in the following drawings. The invention is capable of other embodiments and of being practiced or of being carried out in various ways. Also, it is to be understood that the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting.

DETAILED DESCRIPTION

[0013] Various technologies that pertain to systems and methods will now be described with reference to the drawings, where like reference numerals represent like elements throughout.

The drawings discussed below, and the various embodiments used to describe the principles of the present disclosure in this patent document are by way of illustration only and should not be construed in any way to limit the scope of the disclosure. Those skilled in the art will understand that the principles of the present disclosure may be implemented in any suitably arranged apparatus. It is to be understood that functionality that is described as being carried out by certain system elements may be performed by multiple elements. Similarly, for instance, an element may be configured to perform functionality that is described as being carried out by multiple elements. The numerous innovative teachings of the present application will be described with reference to exemplary non-limiting embodiments. [0014] Also, it should be understood that the words or phrases used herein should be construed broadly, unless expressly limited in some examples. For example, the terms“including,” “having,” and“comprising,” as well as derivatives thereof, mean inclusion without limitation.

The singular forms“a”,“an” and“the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. Further, the term“and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. The term“or” is inclusive, meaning and/or, unless the context clearly indicates otherwise. The phrases“associated with” and“associated therewith,” as well as derivatives thereof, may mean to include, be included within, interconnect with, contain, be contained within, connect to or with, couple to or with, be communicable with, cooperate with, interleave, juxtapose, be proximate to, be bound to or with, have, have a property of, or the like.

[0015] Also, although the terms "first", "second", "third" and so forth may be used herein to refer to various elements, information, functions, or acts, these elements, information, functions, or acts should not be limited by these terms. Rather these numeral adjectives are used to distinguish different elements, information, functions or acts from each other. For example, a first element, information, function, or act could be termed a second element, information, function, or act, and, similarly, a second element, information, function, or act could be termed a first element, information, function, or act, without departing from the scope of the present disclosure.

[0016] In addition, the term "adjacent to" may mean: that an element is relatively near to but not in contact with a further element; or that the element is in contact with the further portion, unless the context clearly indicates otherwise. Further, the phrase“based on” is intended to mean “based, at least in part, on” unless explicitly stated otherwise. Terms“about” or“substantially” or like terms are intended to cover variations in a value that are within normal industry manufacturing tolerances for that dimension. If no industry standard as available a variation of twenty percent would fall within the meaning of these terms unless otherwise stated.

[0017] The software aspects of the present invention could be stored on virtually any computer readable medium including a local disk drive system, a remote server, internet, or cloud-based storage location. In addition, aspects could be stored on portable devices or memory devices as may be required. The computer generally includes an input/output device that allows for access to the software regardless of where it is stored, one or more processors, memory devices, user input devices, and output devices such as monitors, printers, and the like.

[0018] The processor could include a standard micro-processor or could include artificial intelligence accelerators or processors that are specifically designed to perform artificial intelligence applications such as artificial neural networks, machine vision, and machine learning. Typical applications include algorithms for robotics, internet of things, and other data- intensive or sensor-driven tasks. Often AI accelerators are multi-core designs and generally focus on low-precision arithmetic, novel dataflow architectures, or in-memory computing capability. In still other applications, the processor may include a graphics processing unit (GPU) designed for the manipulation of images and the calculation of local image properties.

The mathematical basis of neural networks and image manipulation are similar, leading GPUs to become increasingly used for machine learning tasks. Of course, other processors or

arrangements could be employed if desired. Other options include but are not limited to field- programmable gate arrays (FPGA), application-specific integrated circuits (ASIC), and the like.

[0019] The computer also includes communication devices that may allow for communication between other computers or computer networks, as well as for communication with other devices such as machine tools, work stations, actuators, controllers, sensors, and the like.

[0020] Fig. 1 schematically illustrates a video surveillance system 10 such as the type that might be employed in an office building, a bus terminal, an airport, or other location in which people move about. The surveillance system 10 includes a plurality of imaging devices 15 arranged throughout the facility. Imaging devices 15 typically include digital video cameras 20 but could also include analog video cameras, still cameras, infrared cameras and the like. In addition, sensors such as motion sensors or light sensors could be employed to control the imaging devices 15 to reduce the quantity of video captured without any activity of interest. In still other arrangements, timers may be used to control the operation of the imaging devices 15.

[0021] Each of the devices 15 (camera, sensor, etc.) are arranged to transmit captured data to a computer such as a central server 25 or a cloud-based system for storage and retrieval. In the case of video cameras 20, the video data is transmitted to the central server 25 using a wired or wireless connection. As should be clear, the method of transmitting the captured data as well as the arrangement of the server 25 are largely irrelevant to the invention. Virtually any transmission arrangement or server 25 could be employed as best fits the particular application. As further illustrated in Fig. 1, the central server 25 facilitates the attachment of multiple external computer devices 30 using any common attachment scheme (including wired or wireless). This allows multiple users to conveniently access the collected data using any convenient system.

[0022] Video surveillance systems 10 must be capable of recognizing new, possibly complex and previously unspecified activities in order to be practical and useful. As will be discussed in greater detail, an activity recognition algorithm can be trained to detect activities belonging to a pre-defined set of activities, but in the real-world, the algorithm should be able to detect activities beyond this training set.

[0023] The server 25, or another computer receives captured video data 35 from the video cameras 20 and processes that data 35 in order to be useful. Fig. 2 schematically illustrates the high-level operation of the server 25 in handling both the raw video data 35 and a query 40. The server 25 includes a trained AI system or model 45 that may include, for example a neural network 85, an index 50, and a query handler 55.

[0024] As will be discussed in greater detail, the AI model 45 receives the raw video data 35 and breaks that video data 35 into a plurality of video clips 60 with each clip 60 including a textual description 65 of the content of the clip 60. The textual description 65 generally aligns with or conforms with known primitive activities (e.g., walking, running, sitting, standing, etc.) that are predefined or learned. The video clips 60 and the associated textual descriptions 65 are then passed to the index 50 for storage. The index 50 is essentially a searchable database that efficiently stores the video clips 60, textual descriptions 65, and other data that may be desirable.

[0025] The query handler 55 receives queries 40 input directly to the server 25 by a user or from remote computers 30. The query 40 is preferably in the form of a natural language statement that includes a number of primitive activities and known characteristics. The primitive activities and known characteristics are separated by the query handler 55 into individual primitive activities or characteristics 70 each separated by Boolean operators 75. In other words, the query handler 55 generates a new search criterion 80 that is easily searched in the index 50. [0026] Thus, as will be discussed in greater detail, the video surveillance system 10 of Fig. 1 provides a system and method to recognize new, ad-hoc, previously unspecified activities by exploiting the compositional and hierarchical nature of the natural language. Given a query 40 in natural language, the system 10 formulates an action model or new search criterion 80 with primitive activities 70 connected by simple Boolean logic 75. This results in ease of creating ad- hoc events on-the-fly without any need for retraining the AI model 45 or manually modeling activities. The disclosed method can be used in conjunction with existing activity recognition algorithms that are trained to produce natural language descriptions of videos.

[0027] Fig. 3 illustrates the steps involved in the creation of the trained AI model 45 and its use. The structure of the neural network 85 is first defined and created. Once created, training data 90 in the form of a series of video images or video clips showing primitive activities and characteristics 70 are provided to the neural network 85 to train the neural network 85 to recognize these activities and characteristics 70 in other contexts. The trained AI model 45 is then deployed for use in the video surveillance system 10 such as might be found in a shopping mall, airport, bus terminal, public park, and the like.

[0028] As noted above, the system 10 includes the trained AI model 45 which is derived from an untrained AI model 95. In order to train the AI model 95, training data 90 is first provided to the neural network or other AI arrangement. The training data 90 includes contextual or relevant data to the task required of the AI system 45. For example, if the AI system is intended to identify different types of animals, raw data of known animals would be provided to the system. The present system is intended to identify primitive activities or characteristics 70 of people or groups of people taken from video surveillance data 35. As such, the training data 90 includes images or video of people or groups of people performing primitive activities 70 (e.g., walking, running, talking, standing, sitting, etc.) or having certain characteristics 70 (e.g., hair color, skin color, height, weight, wearing objects, carrying objects, etc.).

[0029] Once the untrained AI model 95 is trained, the trained model 45 is ready for deployment in the video surveillance system 10. The trained model 45 in the present application is operable to receive the video data 35 and separate that video data 35 into the various video clips 60 with each clip 60 having an identifiable primitive activity or characteristic 70. Additional feedback steps 100 can be provided and preferably include an evaluation process 105 in which the decisions made by the AI model are evaluated to determine if the AI model 45 needs to be adjusted, retrained, or improved to enhance the accuracy of decisions made by the AI model 45.

[0030] The video images 35 that are captured can be stored without processing or can be immediately processed prior to storage. If stored without processing, the images 35 would need to be processed in order to be searched in response to some need. Preferred embodiments process the video images 35 as discussed with regard to Fig. 2 prior to storage.

[0031] With reference to Figs. 1 and 2, the video images 35 are provided to the trained model 45 which evaluates the video images 35 to locate primitive activities (e.g., walking, running, talking, standing, sitting, etc.) or characteristics 70 (e.g., hair color, skin color, height, weight, wearing objects, carrying objects, etc.). The model 45 then breaks the video images 35 into distinct video clips 60 with a textual description 65 of each clip 60. In some embodiments, a one to one relationship is maintained between the video clip 60 and the textual description 65. In these embodiments, each textual description 65 describes only a single primitive activity or characteristic 70. Because each clip 60 is related to a single textual description 65, multiple copies of the clip 60 may be stored in the index 50 to include the multiple primitive activities or characteristics 70 that might be contained in the video clip 60. In other constructions, a one to many relationship exists between the video clips 60 and the textual descriptions 65. In these constructions, each video clip 60 may have multiple separate textual descriptions 65 stored in the index 50, with each description 65 describing a single primitive activity or characteristic 70. In preferred constructions, one characteristic 70 of each video clip 60 is the date and time at which it was captured. Another characteristic 70 may include the location of the imaging device 15 that captured the video clip 60. The video clips 60 and the textual descriptions 65 are stored or pointed to the index 50 that is stored on the central server 25 or elsewhere as may be desired to facilitate searching.

[0032] In order to search the video clips 60, a user provides a natural language description or query 40 of what is typically a previously unseen activity. The query handler 55 breaks the natural language query 40 into multiple primitive activities and/or characteristics 70. The primitive activities and characteristics 70 roughly correspond to the textual descriptions 65 generated above for the various video clips 60.

[0033] Fig. 4 illustrates a possible searching scenario in which a user wants to search for people who are performing certain complex activities. In this example, a user wishing to search for a previously unseen activity provides the natural language search string or query 40 describing that activity to the computer 30. For example, the user may wish to search for“three people meeting and walking in an airport terminal, one of the people is not wearing gloves while the other two are wearing black gloves on their left hands.” The query handler 55 breaks this natural language query 40 into a series of individual primitive activities or characteristics 70 or textual strings with each textual string 70 corresponding to the descriptions provided to the index 50 and indicative of the primitive activities and characteristics 70 separated by logical operators 75 (e.g., and, or, not, nand, etc.). In this case, the computer 30 or server 25 might form the following new search criterion 80:

Three people interacting with each other (AND)

Three people walking in an airport terminal (AND)

One person doesn’t wear gloves (AND)

Two persons wearing gloves on their left hands (AND)

Two persons wearing black gloves (AND)

Two persons wearing gloves on their left hands (NOT)

Using the index 50, the computer 30 is able to search the new search criterion 80 and identify any video clips 60 that meet the search criteria 80. The computer 30 can then provide the video clips 60 to the user for immediate display or provide a list for the user to review. The user could also put time limits on the video clips 60 to assure that relevant clips 60 are uncovered only from the selected time window. Similarly, particular locations, such as an airport terminal or a specific location in the airport terminal can be included in the new search criterion 80 to further narrow the search.

[0034] The video surveillance system 10 operates to capture video images from various locations throughout the facility and delivers them to the central server 25. The central server 25 uses the trained AI model 45 to create the index 50 of video clips 60 and textual descriptions 65.

Specifically, the trained AI model 45 recognizes primitive activities and characteristics 70 within the video images 35 and separates the video image 35 into individual video clips 60 based on those primitive activities and characteristics 70. Each textual description 65 describes only one activity or characteristic 70. In addition, characteristics 70 such as a time stamp or a location stamp may be added to each video clip 60 as desired.

[0035] If a user wishes to search for a particular activity or characteristic 70, the user provides a natural language query 40 to the computer 30 or server 25. The query handler 55 then breaks the natural language query 40 into a series of textual strings that match known primitive activities and characteristics 70. The series of textual strings are separated by Boolean or logical operators 75 to define the new search criterion 80 to facilitate searching by the computer 30 or the server 25. The computer 30 or server 25 then searches the index 50 for video clips 60 that match the search criterion 80 and provides those video clips 60 to the user to complete the search.

[0036] The surveillance system 10 described herein advances the art of video surveillance by allowing for the rapid and efficient searching of large volumes of video clips 60 using natural language queries 40 to find the relevant video clips 60. The system 10 is also capable of finding video clips 60 including activities for which the model 45 was not trained by breaking those activities into simpler primitive activities 70 that are known and trained. The ability to search these ad hoc activities greatly improves the accuracy and searching of these video images.

[0037] Although an exemplary embodiment of the present disclosure has been described in detail, those skilled in the art will understand that various changes, substitutions, variations, and improvements disclosed herein may be made without departing from the spirit and scope of the disclosure in its broadest form.

[0038] None of the description in the present application should be read as implying that any particular element, step, act, or function is an essential element, which must be included in the claim scope: the scope of patented subject matter is defined only by the allowed claims.

Moreover, none of these claims are intended to invoke a means plus function claim construction unless the exact words "means for" are followed by a participle.