Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
GENERIC ACTION START DETECTION
Document Type and Number:
WIPO Patent Application WO/2022/061319
Kind Code:
A1
Abstract:
A system may include a feature extraction network and a generic actions start detection network that further includes actionness logic, positioning logic, a processor, and a non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to segment a video file into one or more input clips, and generate a respective feature map of spatial temporal features for each frame of a first input clip of the one or more input clips. The instructions may further be executed by the processor to determine an actionness probability and an action start probability of the first input clip, determine a confidence score based on the actionness probability and the action start probability, and predict an action start index of the video data based on the confidence score.

Inventors:
ZHANG YUEXI (US)
CHEN MING (US)
HO CHIU MAN (US)
Application Number:
PCT/US2021/059002
Publication Date:
March 24, 2022
Filing Date:
November 11, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INNOPEAK TECH INC (US)
International Classes:
G06N20/00; H04N5/232
Domestic Patent References:
WO2021077141A22021-04-22
Foreign References:
US20200302178A12020-09-24
Other References:
SHOU ZHENG ET AL: "Online Action Detection in Untrimmed, Streaming Videos - Modeling and Evaluation", 19 February 2018 (2018-02-19), pages 1 - 10, XP055937212, [retrieved on 20220630]
SPIEGL: "Contrastive Unpaired Translation using Focal Loss for Patch Classification", ARXIV, 2021, XP091059073, Retrieved from the Internet [retrieved on 20220109]
GUO ET AL.: "Effective Parallel Corpus Mining using Bilingual Sentence", ARXIV PREPRINT, XP081262361, Retrieved from the Internet [retrieved on 20220109]
Attorney, Agent or Firm:
BRATSCHUN, Thomas, D. et al. (US)
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1. A method comprising: segmenting, via segmentation logic, a video file into one or more input clips, each input clip of the one or more input clips comprising a plurality of video frames; generating, via a feature extraction network, a respective feature map of spatial temporal features for each frame of a first input clip of the one or more input clips; determining, via an actionness logic, an actionness probability of the first input clip, wherein the actionness probability is a probability that the input clip includes an action start; determining, via a positioning logic, an action start probability of the first input clip, wherein the action start probability is a vector, each field of the vector corresponding to a respective frame of the first input clip, wherein the action start probability is a probability that the respective frame of the first input clip is the action start; determining a confidence score based on the actionness probability and the action start probability; and predicting, via a generic action start detection network, an action start index of the video data based on the confidence score.

2. The method of claim 1, further comprising: triggering, via the generic action start detection network, a camera application to being recording video data at the predicted action start index, in response to determining that the confidence score exceeds a threshold confidence score.

3. The method of claim 2, further comprising: determining, via the generic action start detection network, an end point, wherein at the end point, the camera application stops recording video data, wherein the end point is determined to be a point where the confidence score falls below the threshold confidence score.

4. The method of claim 1, wherein determining the actionness probability further comprises: determining, via the actionness logic, a total number of action frames in the one or more input clips;

28 determining, via the actionness logic, an intersection over union of the number of action frames in the first input clip and the total number of action frames in the one or more input clips; and assigning, via the actionness logic, an actionness probability score of 1 to the first input clip in response to determining that the intersection over union is above a threshold intersection over union.

5. The method of claim 1, wherein predicting the action start index further comprises: determining, via the generic action start detection network, a confidence score of the first input clip, wherein the confidence score is a dot product between the actionness probability and action start probability; determining a relative action start index corresponding to a frame of the first input clip with the highest confidence score; and transforming the relative action start index to an absolute action start index, the absolute action start index corresponding to a location of the frame relative to the video file.

6. The method of claim 1, further comprising: applying, via the generic action start detection network, one or more attention mechanisms to the actionness probability, wherein an output vector is produced by the one or more attention mechanisms; and wherein predicting the action start index further comprises combining the output vector of the one or more attention mechanisms with the action start probability.

7. The method of claim 1, further comprising: determining, via the generic action start detection network, a focal loss of the actionness logic and position logic; and optimizing focal loss by modification of at least one of the feature extraction network, actionness logic, or positioning logic.

8. The method of claim 1, further comprising: determining, via the generic action start detection network, a contrastive loss between all pairs of frames of the input clip; and optimizing the contrastive loss by modification of at least one of the feature extraction network, actionness logic, or positioning logic.

9. An apparatus, comprising: a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to: segment, via segmentation logic, a video file into one or more input clips, each input clip of the one or more input clips comprising a plurality of video frames; generate, via a feature extraction network, a respective feature map of spatial temporal features for each frame of a first input clip of the one or more input clips; determine, via an actionness logic, an actionness probability of the first input clip, wherein the actionness probability is a probability that the input clip includes an action start; determine, via a positioning logic, an action start probability of the first input clip, wherein the action start probability is a vector, each field of the vector corresponding to a respective frame of the first input clip, wherein the action start probability is a probability that the respective frame of the first input clip is the action start; determine a confidence score based on the actionness probability and the action start probability; and predict, via a generic action start detection network, an action start index of the video data based on the confidence score.

10. The apparatus of claim 9, wherein the set of instructions is further executable by the processor to: trigger, via the generic action start detection network, a camera application to being recording video data at the predicted action start index, in response to determining that the confidence score exceeds a threshold confidence score.

11. The apparatus of claim 9, wherein the set of instructions is further executable by the processor to: determine, via the generic action start detection network, an end point, wherein at the end point, the camera application stops recording video data, wherein the end point is determined to be a point where the confidence score falls below the threshold confidence score.

12. The apparatus of claim 9, wherein the set of instructions is further executable by the processor to: determine, via the actionness logic, a total number of action frames in the one or more input clips; determine, via the actionness logic, an intersection over union of the number of action frames in the first input clip and the total number of action frames in the one or more input clips; and assign, via the actionness logic, an actionness probability score of 1 to the first input clip in response to determining that the intersection over union is above a threshold intersection over union.

13. The apparatus of claim 9, wherein the set of instructions is further executable by the processor to: determine, via the generic action start detection network, a confidence score of the first input clip, wherein the confidence score is a dot product between the actionness probability and action start probability; determine a relative action start index corresponding to a frame of the first input clip with the highest confidence score; and transform the relative action start index to an absolute action start index, the absolute action start index corresponding to a location of the frame relative to the video file.

14. The apparatus of claim 9, wherein the set of instructions is further executable by the processor to: apply, via the generic action start detection network, one or more attention mechanisms to the actionness probability, wherein an output vector is produced by the one or more attention mechanisms; and wherein predicting the action start index further comprises combining the output vector of the one or more attention mechanisms with the action start probability.

14. The apparatus of claim 9, wherein the set of instructions is further executable by the processor to: determine, via the generic action start detection network, a focal loss of the actionness logic and position logic; and optimize focal loss by modification of at least one of the feature extraction network, actionness logic, or positioning logic.

15. The apparatus of claim 9, wherein the set of instructions is further executable by the processor to: determine, via the generic action start detection network, a contrastive loss between all pairs of frames of the input clip; and optimize the contrastive loss by modification of at least one of the feature extraction network, actionness logic, or positioning logic.

16. A system for generic action start detection, the system comprising: a feature extraction network; a generic action start detection network coupled to the feature extraction network, wherein the generic action start detection subsystem is a two-stream network, the generic action start detection network comprising: a first stream comprising actionness logic; a second stream comprising positioning logic; a processor; and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to: segment, via segmentation logic, a video file into one or more input clips, each input clip of the one or more input clips comprising a plurality of video frames; generate, via the feature extraction network, a respective feature map of spatial temporal features for each frame of a first input clip of the one or more input clips;

32 determine, via the actionness logic, an actionness probability of the first input clip, wherein the actionness probability is a probability that the input clip includes an action start; determine, via the positioning logic, an action start probability of the first input clip, wherein the action start probability is a vector, each field of the vector corresponding to a respective frame of the first input clip, wherein the action start probability is a probability that the respective frame of the first input clip is the action start; determine a confidence score based on the actionness probability and the action start probability; and predict an action start index of the video data based on the confidence score.

17. The system of claim 16, wherein the set of instructions is further executable by the processor to: trigger, via the generic action start detection network, a camera application to being recording video data at the predicted action start index, in response to determining that the confidence score exceeds a threshold confidence score.

18. The system of claim 16, wherein the set of instructions is further executable by the processor to: determine, via the actionness logic, a total number of action frames in the one or more input clips; determine, via the actionness logic, an intersection over union of the number of action frames in the first input clip and the total number of action frames in the one or more input clips; and assign, via the actionness logic, an actionness probability score of 1 to the first input clip in response to determining that the intersection over union is above a threshold intersection over union.

19. The system of claim 16, wherein the set of instructions is further executable by the processor to:

33 determine, via the generic action start detection network, a confidence score of the first input clip, wherein the confidence score is a dot product between the actionness probability and action start probability; determine a relative action start index corresponding to a frame of the first input clip with the highest confidence score; and transform the relative action start index to an absolute action start index, the absolute action start index corresponding to a location of the frame relative to the video file.

20. The system of claim 16, wherein the set of instructions is further executable by the processor to: optimize for class imbalance, wherein optimizing for class imbalance further includes: determining, via the generic action start detection network, a focal loss of the actionness logic and position logic; and optimizing focal loss by modification of at least one of the feature extraction network, actionness logic, or positioning logic.

34

AMENDED CLAIMS received by the International Bureau on 08 March 2022 (08.03.2022)

WHAT IS CLAIMED IS:

1. A method comprising: segmenting, via segmentation logic, a video file into one or more input clips, each input clip of the one or more input clips comprising a plurality of video frames; generating, via a feature extraction network, a respective feature map of spatial temporal features for each frame of a first input clip of the one or more input clips; determining, via an actionness logic, an actionness probability of the first input clip, wherein the actionness probability is a probability that the input clip includes an action start; determining, via a positioning logic, an action start probability of the first input clip, wherein the action start probability is a vector, each field of the vector corresponding to a respective frame of the first input clip, wherein the action start probability is a probability that the respective frame of the first input clip is the action start; determining a confidence score based on the actionness probability and the action start probability; and predicting, via a generic action start detection network, an action start index of the video data based on the confidence score.

2. The method of claim 1, further comprising: triggering, via the generic action start detection network, a camera application to being recording video data at the predicted action start index, in response to determining that the confidence score exceeds a threshold confidence score.

3. The method of claim 2, further comprising: determining, via the generic action start detection network, an end point, wherein at the end point, the camera application stops recording video data, wherein the end point is determined to be a point where the confidence score falls below the threshold confidence score.

4. The method of claim 1, wherein determining the actionness probability further comprises: determining, via the actionness logic, a total number of action frames in the one or more input clips;

AMENDED SHEET (ARTICLE 19) determining, via the actionness logic, an intersection over union of the number of action frames in the first input clip and the total number of action frames in the one or more input clips; and assigning, via the actionness logic, an actionness probability score of 1 to the first input clip in response to determining that the intersection over union is above a threshold intersection over union.

5. The method of claim 1, wherein predicting the action start index further comprises: determining, via the generic action start detection network, a confidence score of the first input clip, wherein the confidence score is a dot product between the actionness probability and action start probability; determining a relative action start index corresponding to a frame of the first input clip with the highest confidence score; and transforming the relative action start index to an absolute action start index, the absolute action start index corresponding to a location of the frame relative to the video file.

6. The method of claim 1, further comprising: applying, via the generic action start detection network, one or more attention mechanisms to the actionness probability, wherein an output vector is produced by the one or more attention mechanisms; and wherein predicting the action start index further comprises combining the output vector of the one or more attention mechanisms with the action start probability.

7. The method of claim 1, further comprising: determining, via the generic action start detection network, a focal loss of the actionness logic and position logic; and optimizing focal loss by modification of at least one of the feature extraction network, actionness logic, or positioning logic.

8. The method of claim 1, further comprising: determining, via the generic action start detection network, a contrastive loss between all pairs of frames of the input clip; and

36

AMENDED SHEET (ARTICLE 19) optimizing the contrastive loss by modification of at least one of the feature extraction network, actionness logic, or positioning logic.

9. An apparatus, comprising: a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to: segment, via segmentation logic, a video file into one or more input clips, each input clip of the one or more input clips comprising a plurality of video frames; generate, via a feature extraction network, a respective feature map of spatial temporal features for each frame of a first input clip of the one or more input clips; determine, via an actionness logic, an actionness probability of the first input clip, wherein the actionness probability is a probability that the input clip includes an action start; determine, via a positioning logic, an action start probability of the first input clip, wherein the action start probability is a vector, each field of the vector corresponding to a respective frame of the first input clip, wherein the action start probability is a probability that the respective frame of the first input clip is the action start; determine a confidence score based on the actionness probability and the action start probability; and predict, via a generic action start detection network, an action start index of the video data based on the confidence score.

10. The apparatus of claim 9, wherein the set of instructions is further executable by the processor to: trigger, via the generic action start detection network, a camera application to being recording video data at the predicted action start index, in response to determining that the confidence score exceeds a threshold confidence score.

37

AMENDED SHEET (ARTICLE 19)

11. The apparatus of claim 9, wherein the set of instructions is further executable by the processor to: determine, via the generic action start detection network, an end point, wherein at the end point, the camera application stops recording video data, wherein the end point is determined to be a point where the confidence score falls below the threshold confidence score.

12. The apparatus of claim 9, wherein the set of instructions is further executable by the processor to: determine, via the actionness logic, a total number of action frames in the one or more input clips; determine, via the actionness logic, an intersection over union of the number of action frames in the first input clip and the total number of action frames in the one or more input clips; and assign, via the actionness logic, an actionness probability score of 1 to the first input clip in response to determining that the intersection over union is above a threshold intersection over union.

13. The apparatus of claim 9, wherein the set of instructions is further executable by the processor to: determine, via the generic action start detection network, a confidence score of the first input clip, wherein the confidence score is a dot product between the actionness probability and action start probability; determine a relative action start index corresponding to a frame of the first input clip with the highest confidence score; and transform the relative action start index to an absolute action start index, the absolute action start index corresponding to a location of the frame relative to the video file.

14. The apparatus of claim 9, wherein the set of instructions is further executable by the processor to: apply, via the generic action start detection network, one or more attention mechanisms to the actionness probability, wherein an output vector is produced by the one or more attention mechanisms; and

38

AMENDED SHEET (ARTICLE 19) wherein predicting the action start index further comprises combining the output vector of the one or more attention mechanisms with the action start probability.

15. The apparatus of claim 9, wherein the set of instructions is further executable by the processor to: determine, via the generic action start detection network, a focal loss of the actionness logic and position logic; and optimize focal loss by modification of at least one of the feature extraction network, actionness logic, or positioning logic.

16. The apparatus of claim 9, wherein the set of instructions is further executable by the processor to: determine, via the generic action start detection network, a contrastive loss between all pairs of frames of the input clip; and optimize the contrastive loss by modification of at least one of the feature extraction network, actionness logic, or positioning logic.

17. A system for generic action start detection, the system comprising: a feature extraction network; a generic action start detection network coupled to the feature extraction network, wherein the generic action start detection subsystem is a two-stream network, the generic action start detection network comprising: a first stream comprising actionness logic; a second stream comprising positioning logic; a processor; and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to: segment, via segmentation logic, a video file into one or more input clips, each input clip of the one or more input clips comprising a plurality of video frames;

39

AMENDED SHEET (ARTICLE 19) generate, via the feature extraction network, a respective feature map of spatial temporal features for each frame of a first input clip of the one or more input clips; determine, via the actionness logic, an actionness probability of the first input clip, wherein the actionness probability is a probability that the input clip includes an action start; determine, via the positioning logic, an action start probability of the first input clip, wherein the action start probability is a vector, each field of the vector corresponding to a respective frame of the first input clip, wherein the action start probability is a probability that the respective frame of the first input clip is the action start; determine a confidence score based on the actionness probability and the action start probability; and predict an action start index of the video data based on the confidence score.

18. The system of claim 17, wherein the set of instructions is further executable by the processor to: trigger, via the generic action start detection network, a camera application to being recording video data at the predicted action start index, in response to determining that the confidence score exceeds a threshold confidence score.

19. The system of claim 17, wherein the set of instructions is further executable by the processor to: determine, via the actionness logic, a total number of action frames in the one or more input clips; determine, via the actionness logic, an intersection over union of the number of action frames in the first input clip and the total number of action frames in the one or more input clips; and assign, via the actionness logic, an actionness probability score of 1 to the first input clip in response to determining that the intersection over union is above a threshold intersection over union.

AMENDED SHEET (ARTICLE 19)

20. The system of claim 17, wherein the set of instructions is further executable by the processor to: determine, via the generic action start detection network, a confidence score of the first input clip, wherein the confidence score is a dot product between the actionness probability and action start probability; determine a relative action start index corresponding to a frame of the first input clip with the highest confidence score; and transform the relative action start index to an absolute action start index, the absolute action start index corresponding to a location of the frame relative to the video file.

41

AMENDED SHEET (ARTICLE 19)

21. (CANCELED)

42

AMENDED SHEET (ARTICLE 19)

Description:
GENERIC ACTION START DETECTION

COPYRIGHT STATEMENT

[0001] A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.

FIELD

[0002] The present disclosure relates, in general, to methods, systems, and apparatuses for machine learning based automatic camera control.

BACKGROUND

[0003] Cameras are an integral part of modem mobile devices, and a frequently used tool. While cameras are readily accessible at our fingertips, spontaneous events, such as a chance encounter with a celebrity, or a baby's first steps, are often difficult to capture. Even when an event is expected to occur, without exact information of when an event will begin, there are inefficiencies and redundancies involved with capturing such events. For example, when capturing an event on video, periods of inactivity are also captured between significant actions occurring. This leads to large files filled with unwanted footage, which further leads to inefficient use of storage space.

[0004] Thus, methods, systems, and apparatuses for generic action start detection are provided.

SUMMARY

[0005] Tools and techniques for generic action start detection are provided.

[0006] A method may include segmenting, via segmentation logic, a video file into one or more input clips, each input clip of the one or more input clips comprising a plurality of video frames, and generating, via a feature extraction network, a respective feature map of spatial temporal features for each frame of a first input clip of the one or more input clips. The method further includes determining, via an actionness logic, an actionness probability of the first input clip, wherein the actionness probability is a probability that the input clip includes an action start, determining, via a positioning logic, an action start probability of the first input clip, wherein the action start probability is a vector, each field of the vector corresponding to a respective frame of the first input clip, wherein the action start probability is a probability that the respective frame of the first input clip is the action start, and determining a confidence score based on the actionness probability and the action start probability. The method continues by predicting, via a generic action start detection network, an action start index of the video data based on the confidence score.

[0007] An apparatus may include a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to perform various functions. The set of instructions may be executed by the processor to segment, via segmentation logic, a video file into one or more input clips, each input clip of the one or more input clips comprising a plurality of video frames, and generate, via a feature extraction network, a respective feature map of spatial temporal features for each frame of a first input clip of the one or more input clips. The instructions may further be executed by the processor to determine, via an actionness logic, an actionness probability of the first input clip, wherein the actionness probability is a probability that the input clip includes an action start, determine, via a positioning logic, an action start probability of the first input clip, wherein the action start probability is a vector, each field of the vector corresponding to a respective frame of the first input clip, wherein the action start probability is a probability that the respective frame of the first input clip is the action start, and determine a confidence score based on the actionness probability and the action start probability. The instructions may further be executed by the processor to predict, via a generic action start detection network, an action start index of the video data based on the confidence score.

[0008] A system may include a feature extraction network, and a generic action start detection network coupled to the feature extraction network, wherein the generic action start detection subsystem is a two-stream network, the generic action start detection network that includes a first stream comprising actionness logic, and a second stream comprising positioning logic. The generic action start detection network may further include a processor, and a non- transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to segment, via segmentation logic, a video file into one or more input clips, each input clip of the one or more input clips comprising a plurality of video frames, and generate, via the feature extraction network, a respective feature map of spatial temporal features for each frame of a first input clip of the one or more input clips. The instructions may further be executed by the processor to determine, via the actionness logic, an actionness probability of the first input clip, wherein the actionness probability is a probability that the input clip includes an action start, determine, via the positioning logic, an action start probability of the first input clip, wherein the action start probability is a vector, each field of the vector corresponding to a respective frame of the first input clip, wherein the action start probability is a probability that the respective frame of the first input clip is the action start, and determine a confidence score based on the actionness probability and the action start probability. The instructions may further be executed by the processor to predict an action start index of the video data based on the confidence score.

[0009] These illustrative embodiments are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided therein.

BRIEF DESCRIPTION OF THE DRAWINGS

[0010] A further understanding of the nature and advantages of particular embodiments may be realized by reference to the remaining portions of the specification and the drawings, in which like reference numerals are used to refer to similar components. In some instances, a sublabel is associated with a reference numeral to denote one of multiple similar components. When reference is made to a reference numeral without specification to an existing sub-label, it is intended to refer to all such multiple similar components.

[0011] Fig. 1A is a schematic block diagram of a system for generic action start detection, in accordance with various embodiments;

[0012] Fig. IB is a schematic block diagram of an alternative system for generic action start detection, in accordance with various embodiments;

[0013] Fig. 2 is a schematic diagram of input clip labeling, in accordance with various embodiments;

[0014] Fig. 3 is a schematic diagram illustrating a graph of a confidence score over the frames of a video with corresponding example frames, in accordance with various embodiments; [0015] Fig. 4 is a flow diagram of a method for generic action start detection, in accordance with various embodiments;

[0016] Fig. 5 is a schematic block diagram of a computer system for generic action start detection, in accordance with various embodiments.

DETAILED DESCRIPTION OF EMBODIMENTS

[0017] Various embodiments provide tools and techniques for generic action start detection are provided.

[0018] In some embodiments, a method for generic action start detection is provided. A method may segmenting, via segmentation logic, a video file into one or more input clips, each input clip of the one or more input clips comprising a plurality of video frames, and generating, via a feature extraction network, a respective feature map of spatial temporal features for each frame of a first input clip of the one or more input clips. The method further includes determining, via an actionness logic, an actionness probability of the first input clip, wherein the actionness probability is a probability that the input clip includes an action start, determining, via a positioning logic, an action start probability of the first input clip, wherein the action start probability is a vector, each field of the vector corresponding to a respective frame of the first input clip, wherein the action start probability is a probability that the respective frame of the first input clip is the action start, and determining a confidence score based on the actionness probability and the action start probability. The method continues by predicting, via a generic action start detection network, an action start index of the video data based on the confidence score.

[0019] In some embodiments, the method may further include triggering, via the generic action start detection network, a camera application to being recording video data at the predicted action start index, in response to determining that the confidence score exceeds a threshold confidence score. In some examples, the method may further include determining, via the generic action start detection network, an end point, wherein at the end point, the camera application stops recording video data, wherein the end point is determined to be a point where the confidence score falls below the threshold confidence score.

[0020] In some embodiments, determining the actionness probability may further include determining, via the actionness logic, a total number of action frames in the one or more input clips, determining, via the actionness logic, an intersection over union of the number of action frames in the first input clip and the total number of action frames in the one or more input clips, and assigning, via the actionness logic, an actionness probability score of 1 to the first input clip in response to determining that the intersection over union is above a threshold intersection over union.

[0021] In some embodiments, predicting the action start index may further include determining, via the generic action start detection network, a confidence score of the first input clip, wherein the confidence score is a dot product between the actionness probability and action start probability, determining a relative action start index corresponding to a frame of the first input clip with the highest confidence score, and transforming the relative action start index to an absolute action start index, the absolute action start index corresponding to a location of the frame relative to the video file.

[0022] In further embodiments, the method may include applying, via the generic action start detection network, one or more attention mechanisms to the actionness probability, wherein an output vector is produced by the one or more attention mechanisms, and wherein predicting the action start index further comprises combining the output vector of the one or more attention mechanisms with the action start probability.

[0023] In some embodiments, the method may include determining, via the generic action start detection network, a focal loss of the actionness logic and position logic, and optimizing focal loss by modification of at least one of the feature extraction network, actionness logic, or positioning logic. In further embodiments, the method may include determining, via the generic action start detection network, a contrastive loss between all pairs of frames of the input clip, and optimizing the contrastive loss by modification of at least one of the feature extraction network, actionness logic, or positioning logic.

[0024] In some embodiments, an apparatus for generic action start detection is provided. The apparatus may include a processor, and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to perform various functions. The set of instructions may be executed by the processor to segment, via segmentation logic, a video file into one or more input clips, each input clip of the one or more input clips comprising a plurality of video frames, and generate, via a feature extraction network, a respective feature map of spatial temporal features for each frame of a first input clip of the one or more input clips. The instructions may further be executed by the processor to determine, via an actionness logic, an actionness probability of the first input clip, wherein the actionness probability is a probability that the input clip includes an action start, determine, via a positioning logic, an action start probability of the first input clip, wherein the action start probability is a vector, each field of the vector corresponding to a respective frame of the first input clip, wherein the action start probability is a probability that the respective frame of the first input clip is the action start, and determine a confidence score based on the actionness probability and the action start probability. The instructions may further be executed by the processor to predict, via a generic action start detection network, an action start index of the video data based on the confidence score.

[0025] In some embodiments, the set of instructions may further be executable by the processor to trigger, via the generic action start detection network, a camera application to being recording video data at the predicted action start index, in response to determining that the confidence score exceeds a threshold confidence score. In some embodiments, the set of instructions may the set of instructions may further be executable by the processor to determine, via the generic action start detection network, an end point, wherein at the end point, the camera application stops recording video data, wherein the end point is determined to be a point where the confidence score falls below the threshold confidence score.

[0026] In some embodiments, the set of instructions may the set of instructions may further be executable by the processor to determine, via the actionness logic, a total number of action frames in the one or more input clips, determine, via the actionness logic, an intersection over union of the number of action frames in the first input clip and the total number of action frames in the one or more input clips, and assign, via the actionness logic, an actionness probability score of 1 to the first input clip in response to determining that the intersection over union is above a threshold intersection over union.

[0027] In some embodiments, the set of instructions may the set of instructions may further be executable by the processor to determine, via the generic action start detection network, a confidence score of the first input clip, wherein the confidence score is a dot product between the actionness probability and action start probability, determine a relative action start index corresponding to a frame of the first input clip with the highest confidence score, and transform the relative action start index to an absolute action start index, the absolute action start index corresponding to a location of the frame relative to the video file.

[0028] In some embodiments, the set of instructions may the set of instructions may further be executable by the processor to apply, via the generic action start detection network, one or more attention mechanisms to the actionness probability, wherein an output vector is produced by the one or more attention mechanisms, and wherein predicting the action start index further comprises combining the output vector of the one or more attention mechanisms with the action start probability.

[0029] In some embodiments, the set of instructions may the set of instructions may further be executable by the processor to determine, via the generic action start detection network, a focal loss of the actionness logic and position logic, and optimize focal loss by modification of at least one of the feature extraction network, actionness logic, or positioning logic. In some embodiments, the set of instructions may the set of instructions may further be executable by the processor to determine, via the generic action start detection network, a contrastive loss between all pairs of frames of the input clip, and optimize the contrastive loss by modification of at least one of the feature extraction network, actionness logic, or positioning logic.

[0030] In in further embodiments, a system generic action start detection is provided. The system may include a feature extraction network, and a generic action start detection network coupled to the feature extraction network, wherein the generic action start detection subsystem is a two- stream network, the generic action start detection network that includes a first stream comprising actionness logic, and a second stream comprising positioning logic. The generic action start detection network may further include a processor, and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to segment, via segmentation logic, a video file into one or more input clips, each input clip of the one or more input clips comprising a plurality of video frames, and generate, via the feature extraction network, a respective feature map of spatial temporal features for each frame of a first input clip of the one or more input clips. The instructions may further be executed by the processor to determine, via the actionness logic, an actionness probability of the first input clip, wherein the actionness probability is a probability that the input clip includes an action start, determine, via the positioning logic, an action start probability of the first input clip, wherein the action start probability is a vector, each field of the vector corresponding to a respective frame of the first input clip, wherein the action start probability is a probability that the respective frame of the first input clip is the action start, and determine a confidence score based on the actionness probability and the action start probability. The instructions may further be executed by the processor to predict an action start index of the video data based on the confidence score.

[0031] In some embodiments, the set of instructions may further be executable by the processor to trigger, via the generic action start detection network, a camera application to being recording video data at the predicted action start index, in response to determining that the confidence score exceeds a threshold confidence score. In some embodiments, the set of instructions may the set of instructions may further be executable by the processor to determine, via the generic action start detection network, an end point, wherein at the end point, the camera application stops recording video data, wherein the end point is determined to be a point where the confidence score falls below the threshold confidence score.

[0032] In some embodiments, the set of instructions may the set of instructions may further be executable by the processor to determine, via the actionness logic, a total number of action frames in the one or more input clips, determine, via the actionness logic, an intersection over union of the number of action frames in the first input clip and the total number of action frames in the one or more input clips, and assign, via the actionness logic, an actionness probability score of 1 to the first input clip in response to determining that the intersection over union is above a threshold intersection over union.

[0033] In some embodiments, the set of instructions may the set of instructions may further be executable by the processor to determine, via the generic action start detection network, a confidence score of the first input clip, wherein the confidence score is a dot product between the actionness probability and action start probability, determine a relative action start index corresponding to a frame of the first input clip with the highest confidence score, and transform the relative action start index to an absolute action start index, the absolute action start index corresponding to a location of the frame relative to the video file.

[0034] In some embodiments, the set of instructions may the set of instructions may further be executable by the processor to optimize for class imbalance, wherein optimizing for class imbalance further includes determining, via the generic action start detection network, a focal loss of the actionness logic and position logic, and optimizing focal loss by modification of at least one of the feature extraction network, actionness logic, or positioning logic.

[0035] In the following description, for the purposes of explanation, numerous details are set forth to provide a thorough understanding of the described embodiments. It will be apparent to one skilled in the art, however, that other embodiments may be practiced without some of these details. In other instances, structures and devices are shown in block diagram form. Several embodiments are described herein, and while various features are ascribed to different embodiments, it should be appreciated that the features described with respect to one embodiment may be incorporated with other embodiments as well. By the same token, however, no single feature or features of any described embodiment should be considered essential to every embodiment of the invention, as other embodiments of the invention may omit such features.

[0036] Unless otherwise indicated, all numbers used herein to express quantities, dimensions, and so forth used should be understood as being modified in all instances by the term "about." In this application, the use of the singular includes the plural unless specifically stated otherwise, and use of the terms "and" and "or" means "and/or" unless otherwise indicated. Moreover, the use of the term "including," as well as other forms, such as "includes" and "included," should be considered non-exclusive. Also, terms such as "element" or "component" encompass both elements and components comprising one unit and elements and components that comprise more than one unit, unless specifically stated otherwise.

[0037] The various embodiments include, without limitation, methods, systems, apparatuses, and/or software products. Merely by way of example, a method might comprise one or more procedures, any or all of which may be executed by a computer system.

Correspondingly, an embodiment might provide a computer system configured with instructions to perform one or more procedures in accordance with methods provided by various other embodiments. Similarly, a computer program might comprise a set of instructions that are executable by a computer system (and/or a processor therein) to perform such operations. In many cases, such software programs are encoded on physical, tangible, and/or non-transitory computer readable media (such as, to name but a few examples, optical media, magnetic media, and/or the like). [0038] Various embodiments described herein, embodying software products and computer-performed methods, represent tangible, concrete improvements to existing technological areas, including, without limitation, computer vision and video processing. Specifically, implementations of various embodiments provide for a way to detect the start of a significant action or event in a video file using a machine learning (ML) model.

[0039] Conventional approaches to event boundary detection are inefficient and not generalized. Traditional methods are built upon the assumption that there exists nontrivial abrupt transition between the background and the action. Thus, traditional methods detect shot transitions the within the video and segment the video stream into meaningful and manageable segments. Traditional solutions include similar scene clustering or moving object detection to segment the whole videos if abrupt change is detected. Moreover, if moving object detection is used, a preset velocity threshold needs to first be determined. Known techniques, such as generic event boundary detection (GEBD), typically, entire videos must be processed and further in an offline manner as it is built upon "propose and detect" architecture. Other known techniques, such as online detection of action start (ODAS) and StartNet focus on identifying an action category, detecting the action starts, and trying to minimize the time delay of identifying the start point of an action. These techniques focus on localizing pre-defined action categories and does not scale to generic videos, and introduce additional computational complexity.

[0040] Thus, generic action start detection (GASD) model, set forth below, allows for a more robust action start solution, which requires fewer priors, less training data, and improved generalization across different data sets.

[0041] To the extent any abstract concepts are present in the various embodiments, those concepts can be implemented as described herein by devices, software, systems, and methods that involve novel functionality (e.g., steps or operations), such as generic action start detection. [0042] Fig. 1A is a schematic block diagram of a system 100A for generic action start detection. The system 100A includes camera application 105, video data 110, generic action start detection (GASD) logic 155, input clips 115, feature extractor 120, actionness module 125, positioning module 130, actionness probability 135, action start probability 140, loss function 145, and output prediction 150. It should be noted that the various components of the system 100A are schematically illustrated in Fig. 1A, and that modifications to the various components and other arrangements of system 100A may be possible and in accordance with the various embodiments.

[0043] In various embodiments, the camera application 105 may be configured to produce video data 110, which may be provided to the GASD logic 155. The GASD logic 155 may be configured to generate a plurality of input clips 115 from the video data 110. The input clips 115 may be provided to the feature extractor 120. The feature extractor 120 may be coupled to the actionness module 125 and positioning module 130. The actionness module may generate an actionness probability 135. The positioning module 130 may generate an action start probability 140. The actionness module 125 and positioning module 130 may be referred to, interchangeably, as "actionness logic" or an "actionness classifier," and "positioning logic" or "positioning classifier," respectively. The actionness probability 135 and action start probability 140 may be combined to produce an output prediction 150. The actionness probability 135 may further be used to calculate a loss function 145. The loss calculated by the loss function 145 may be used to optimize one or more models of the actionness module 125.

[0044] In various embodiments, the camera application 105 may be an application of a user device, configured to capture video data 110 from a camera of the user device. Thus, the camera application 105 may start and stop when video is recorded by the camera. Video data 110 may include video data that has been encoded and/or raw video data received from the camera application 105. In some embodiments, the video data 110 may real-time and/or live video data captured by the camera application 105 before it is saved and/or stored. For example, video data 110 may include a plurality of video clips and/or video frames obtained from an image sensor, such as a charge-coupled device (CCD) and/or complementary metal oxide semiconductor (CMOS) sensor, in real-time. In other examples, the video data 110 may include a video file and/or one or more video frames that have been stored, temporarily or otherwise, for example in a buffer or other storage device. In various embodiments, the video data 110 may thus be provided, by the camera application 105, to GASD logic 155 for further processing.

[0045] In some embodiments, the GASD logic 155, and/or its subcomponents, including the feature extractor 120, actionness module 125, and positioning module 130, may be implemented as hardware, and/or software running on one or more computer systems of the system 100A. Accordingly, the computer systems may include one or more physical machines or one or more virtual machines (VM) configured to implement the GASD logic 155, feature extractor 120, actionness module 125, and/or positioning module 130. The one or more computer systems may be arranged in a distributed (or centralized) architecture, such as in a cloud platform. In further embodiments, the GASD logic 155 and/or one or more of its subcomponents may be implemented locally on a user device or computer system.

[0046] Accordingly, in some embodiments, GASD logic 155 may be configured to obtain video data 110. In some examples, the GASD logic 155 may obtain video data 110 from the camera application 105. In some examples, GASD logic 155 may be configured to obtain video data 110 from another source or location, such as a storage device, or a video data buffer device. The GASD logic 155 may include, video segmentation logic (not shown) configured to generate a plurality of input clips 115 from the video data 110.

[0047] As previously described, the GASD model, and specifically the GASD logic 155, provides a genericized solution to detect a taxonomy-free action start in near real-time. To accomplish this, a two-stream network is adopted (also referred to as a GASD model network). For example, a first stream, via the actionness module 125, models the overall actionness of an input frame 115 on a coarse level. A second stream, via the positioning module 130, models the action starts within the clip on a fine level. A final prediction, the output prediction 150, is produced based on a combination of the two streams.

[0048] Given that action instances of video data 110 are typically not annotated with time stamps for when actions begin, and that different action instances might occur at different speeds and over different durations, and the difference in appearance and feature space between temporally adjacent video clips is trivial, a multi-scale sampling scheme may be adopted. Thus in some examples, video data 105 may be sampled to produce the plurality of input clips 115. In some embodiments, a temporal sliding window at varied lengths, for example, 16, 32, 64, 128, 256, and/or 512 frames may be taken to produce clips. In some examples, the sliding windows may be taken with a 75% overlap of frames between each window. In some examples, for each window, a uniform sampling of one or more frames may be taken by the GASD logic 155. In some examples, a T-number of frames may be taken during each window to produce an input clip, of the plurality of input clips 115, having a length of T-frames. During inference, all the sampled windows may be kept the same length.

[0049] In various embodiments, an input clip, having one or more sampled frames, of the plurality of input clips 115 may be fed to the feature extractor 120. The feature extractor 120 may include a machine learning (ML) based feature extraction network. For example, the feature extractor 120 may include, without limitation, a feed-forward network (FFN) such as a multilayer perceptron, convolutional neural network (CNN), transformer network, or other suitable neural network architecture. The feature extractor 120 may be configured to generate a feature map each of the one or more frames of the sampled input clip.

[0050] In various embodiments, the feature extractor 120 may be configured to extract spatial-temporal feature through the use of a 3-dimensional (3D) model (e.g., a 3D CNN model). For example, the model may include a sequence of 2-dimensional (2D) frames over time. In further embodiments, the architecture of the feature extractor 120 may be configured to extract features through the use of a 2D model (spatial) and a separate 1-dimenional (ID) model to capture temporal elements. The feature extractor may further be configured to provide the extracted spatial-temporal features (e.g., the feature map(s)) to the actionness module 125 and position module 130 concurrently.

[0051] In various embodiments, the actionness module 125 and positioning module 130 may include logic implemented as software and/or hardware, and configured to be executed as part of the GASD logic 155, or independently from other subcomponents of the GASD logic 155. Thus, in some examples, the actionness module 125 and/or positioning module 130 may be subroutines of the GASD logic 155, or alternatively, functions that may be invoked remotely by the GASD logic 155.

[0052] In various embodiments, the actionness module 125 may be configured to determine, based on the one or more feature maps of a respective input clip, an actionness probability 135. Thus, the actionness probability 135 may be a probability that a specific input clip, that is, the specific input window of a video (e.g., video data 110), includes action. The positioning module 130 may be configured to encode the spatial-temporal features, based on the one or more feature maps of the respective input clip, to output the specific locations (e.g., frames of the input clip) at which action starts.

[0053] In various embodiments, the actionness module 125 may be a binary classifier. Accordingly, in some examples, the actionness module 125 may be configured to assign a first value to a clip if the actionness module 125 predicts that the clip contains action, and a second value is assigned to the clip if it is predicted not to contain action. For example, a clip may be labeled with a value of 1 (action) if the ratio of action frames is larger than an intersection over union (loU) threshold. Otherwise, the input clip may be assigned level 0 (background) if the ratio of action frames is lower than the loU threshold. Thus, the output of the actionness module 125 may be referred to as the actionness probability, which in the above example, is a probability that a given input clip will be assigned a classification of 1 (e.g., includes action frames).

[0054] Data presented to the actionness module 125 and/or positioning module 130, whether during training or otherwise, may include various instances of different action start classes (also referred to as action classes). Action classes may be associated with, for example, a type of action or event, such as jumping, walking, running, etc. that occurs in the input clip. Thus, in some embodiments, the action classes may include various classes as identified by the feature extractor 120.

[0055] In some examples, focal loss may be utilized to train the actionness module 125 and/or position module 130 to handle class imbalance issues (e.g., overrepresentation of one action class over another action class). In some examples, the loss function for the actionness module 125 may defined as:

Where, t p is the ground truth label and p t is the probability of tth action class. In some examples, C may be set to a value of 2 corresponding to two classes of actions.

[0056] In some alternative embodiments, a contrastive loss may further be utilized to train the actionness module 125. In some embodiments, contrastive loss or a combination of both contrastive loss and focal loss may be utilized to train the actionness module 125, for example, by updating one or more parameters of a model of the GASD logic 155. Models may include actionness models. To determine contrastive loss, for example, the actionness module 125 may be trained in pairs (e.g., pairs of frames) to further distinguish the input actionness category. Specifically, an all-pair (e.g., pairs of all frames in the sampled input clip) learning approach may be utilized. In some examples, the loss function is defined as: where N is the total number of training samples, fi, i and /i,2 are features of i th pair of frames. In addition, yi is the similarity label of i th pair, where a value of 1 indicates that the input pair is from the same category (e.g., action class), and a value of 0 indicates that the input pair is not from the same category. [0057] In further embodiments, the positioning module 130 may be a multi-class (T+l) classifier, where T is the number of frames in a video clip (e.g., the input clip). For example, for an input clip with T number of frames, where T is an integer, there may be a a T-number of indices corresponding to the respective number of frames. The positioning module 130 may set the index of action start, associated with the respective frame, to a value reflecting a probability that an action start exists at the respective frame. Otherwise, the action start index may be set to a value of -1. The loss function for the position module 130 may defined as: log(Pt) Eq. (3)

Where, p' is the final probability (confidence score) which is defined as the joint probability output by the actionness and position modules, as written below:

P P actionness Pioc Eq. (4)

Where p a ctionness is the probability that the input is predicted to be 1 (action), and pi oc is a vector with a length of T+l that represent the probability of each index to be action start.

[0058] Thus, the output prediction 150 may, in some examples, be a prediction of an action start (AS) occurring at a certain frame. For example, the output prediction 150 may output an index (e.g., predicted ASindex) and confidence score. To determine the predicted ASindex, an AS index within a given clip (AS relative-index) may be selected as the index with the highest confidence score, and may be expressed as:

AS re iative— index ctvgTncix p ) Eq. (5)

To determine the absolute index of AS ASindex from the relative index, the relative AS index may be transformed with a starting frame index of the input clip relative to the video file from which the input clip was sampled, which may be expressed as: Sindex ~ AS reiat:iV e-index + clip_start index Eq. (6)

[0059] Accordingly, in some examples, the output prediction 150 may be an output of both the predicted ASindex, and confidence score p'.

[0060] In some embodiments, during an evaluation stage of the GASD model, each AS index prediction may be considered correct when its action class from the actionness module is correct, and its offset is smaller than an evaluation threshold. Moreover, unique matching to a ground-truth point may be enforced (e.g., no duplicate matches to a given ground-truth point). In further embodiments, a predicted AS index may be used to trigger recording (e.g., storing of video data) at a first instance that a confidence score is higher than a threshold confidence score that may be determined by grid search on the training set. For example, in some embodiments, the GASD logic 155 may be configured to trigger the camera application to store video data, starting at the frame, indicated by predicted AS index, when the confidence score is higher than a threshold score.

[0061] Accordingly, in some examples, the GASD logic 155 may be configured to provide real-time "online" key event detection (e.g., action start detection). In further examples, the real-time action start detection of GASD logic 155 may be used to trigger automatic video recording, via the camera app 105, as described above. In further examples, the GASD logic 155 may be configured to provide generic event boundary detection, which benefits downstream, "offline" video processing, in which already stored video data may be edited, cropped, trimmed, or otherwise used to identify start action boundaries.

[0062] In the context of "online" or real-time action start detection, certain events may be difficult to capture manually due to a shortened transient duration of an event or due to the unpredictability of when an event may begin. Some examples may include, without limitation, water drops falling, a balloon popping, or points being scored during a sporting event. In such events, automatic video recording may be desired.

[0063] Thus, by GASD logic 155 may, in some embodiments, be configured to interface with the camera app 105, to automatically decide a time to begin recording while eliminating delays which could lead to a user missing the desired event altogether. In some embodiments, for example, GASD logic 155 may be configured to control the camera app 105 or transmit a command to the camera app 105 to begin storing video data. Alternatively, GASD logic 155 may, in some examples, be configured to determine and store relevant video data 110 based on the action start prediction.

[0064] In some embodiments, a user of a user device may initiate recording via the camera app 105. For example, the user may push a record button (or other UI element) in the camera app 105 to begin recording. A video monitoring module of the camera app 105 may be evoked, such that video frames are passed to the GASD logic 155. In some examples, the camera app 105 may be configured to directly pass the video frames to the GASD logic 155. In other examples, the camera app 105 may first send the video frames to a buffer, from which the GASD logic 155 may obtain the one or more video frames. In some examples, if no action start is detected in a frame, the frame may be ignored without saving. If an action start is detected, the frame and all subsequent frames may be saved until an end point. In some examples, a fixed recording duration may be used to determine an end point. In further examples, and end point may be reached when the user manually stops the recording. In yet further embodiments, an end point may be automatically determined by a frame, or when a series of frames, in which no action is detected. In some embodiments, an end point may be determined when a confidence score falls below a threshold confidence score. In some examples, the actionness module 125 and/or positioning module 130 may be used in combination to determine whether a frame and/or series of one or more frames does not contain action (e.g., are background frames), and automatically stop recording at the camera app 105. Accordingly, GASD logic 155 may automatically determine a point at which an event occurs (e.g., action start) and trigger video recording when action start is predicted. Thus, by determining when an action start is detected, and which clips contain an event or detected action start, GASD logic 155 may provide more efficient storage of video data, and improve performance of downstream tasks, such as video summarization and video file management.

[0065] In further embodiments, GASD logic 155 may further be configured to segment video data 110 into multiple input clips. As previously described, the GASD logic 155 may further include segmentation logic to segment a video file into one or more input clips. As previously described, video clips may be segmented using a sliding window of a given length of frames. The sliding window, for example, may be 16, 32, 64, 128, 256 and/or 512 frames in length. In some examples, the input clips 115 produced by the sliding window may have a 75% overlap.

[0066] The above embodiments, described with respect to Fig. 1A, produces output prediction 150 as a joint probability output by the actionness module 125 and positioning module 130. In system 300A, the output of the actionness module 125 and positioning module 130 are obtained at a late stage. Specifically, the actionness probability 135 is multiplied with the action start probability 140 vector from the positioning module 130. An alternative approach is provided with respect to Fig. IB below.

[0067] Fig. IB is a schematic block diagram of an alternative system 100B for generic action start detection. The system 100B includes camera application 105, video data 110, generic action start detection (GASD) logic 155, input clips 115, feature extractor 120, actionness module 125, positioning module 130, actionness probability 135, action start probability 140, output prediction 150, and attention mechanism 160. It should be noted that the various components of the system 100B are schematically illustrated in Fig. IB, and that modifications to the various components and other arrangements of system 100B may be possible and in accordance with the various embodiments.

[0068] In contrast with the system 100A of Fig. 1A, system 100B comprises an attention mechanism 160 cascaded after the actionness module 125. In various embodiments, the actionness probability 135 may be fed to an attention mechanism 160. The attention mechanism 160 may include, for example, an attention head for applying attention to the actionness probability 135. The output of the attention mechanism 160 may then be combined with the positioning module to produce the final output prediction 150.

[0069] In some examples, the actionness probability 135 may be a scalar value, within the range of 0 to 1. The actionness probability 135 may be given by pactionness. If pactionness > 0.5, the input clip has a higher probability (e.g., more likely than not) to contain an action or event. If Pactionness < 0.5, and thus 1-pactionness > 0.5, the input clip has higher probability to be a background clip. In this way the actionness probability 135 may be treated as a vector, pactionness = [pactionness, 1 ~P actionnes] ■

[0070] As previously described, the positioning module 130 may be a multiclass (T+l) classifier. When the output from positioning module is within the range 0 - T, the action start is within the input clip, while if the output from the positioning module 130 equal to (T+l), then the input clip does not contain an action start. Thus, the output of the positioning module 130 may be considered a vector of size (T+l), where pioc = [pioc,i, pioc, 2, . . . , pioc, T+/] .

[0071] Thus, in various embodiments, attention mechanism 160 may be configured to apply attention between the ctionness probability 135 and the output of the positioning module 130 to generate an action start probability 140. Specifically, in some examples, a product may be taken between the scalar value pactionness and the first T-elements of the vector pi oc , and the scalar value ( -pactionness) may be multiplied with the last element (T+l) of the vector pi oc - Accordingly, the action start probability 140, after the attention mechanism, is given by the vector [pactionness * pioc , . . . Pactionness * pioc, T, (1-p actionnes)* pioc, T+i}, which may further be used to give the output prediction 150, as described above.

[0072] Fig. 2 is a schematic diagram of input clip labeling framework 200, in accordance with various embodiments. Specifically, input clip labeling framework 200 includes an input clip 205, ground truth background frames 210a, ground truth action frames 215, and ground truth background frames 210b. As previously described, an actionness module may be configured to assign a first value to a clip if the actionness module predicts that the clip contains action, and assign a second value to the clip if it is predicted not to contain action. For example, a clip may be labeled with a value of 1 (action) if the ratio of action frames is larger than an loll threshold. Thus, in some examples, loll may refer to the ratio of overlapping frames to the total number of frames (of the input clip 205 and action frames 215 combined) minus the number of overlapping frames.

[0073] For example, the number of overlapping frames may be the number of frames of the input clip 205 that are part of the action frames 215. The action frames 215 may begin at a starting point 220 corresponding to a first time, and end point 225 corresponding to a second time. In some examples, a video file may include sequences of background frames 210a, 210b, in which action does not occur, and action frames 215 in which action occurs. In some embodiments, the action frames 215 may be annotated or otherwise known. However, in many examples, the starting point 220 and end point 225 may not be known. Thus, in some examples, the features of the frames of one or more input clips, including input clip 205, may be extracted by a feature extractor. Temporally adjacent frames may be examined to determine which frames include action, and thus starting point 220 and end point 225 may be identified.

[0074] In further examples, action frames 215 may be frames which are predicted by the actionness module to include action. Thus, the frames of one or more input clips corresponding to the action frames 215 may first be predicted to contain action by the actionness module. In some examples, the frames of the one or more input clips, including input clip 205, may be added to find the total number of action frames 215. Once the total number of action frames 215 is known, loll may be calculated for each input clip 205.

[0075] Accordingly, the loll may be calculated for each input clip 205. If a given input clip is below a threshold loll score, the actionness module may assign a actionness label of 0. That is, if a given input clip 205 does not have a threshold number of frames that are action frames 215, a score of 0 may be assigned to the input clip. If a given input clip 205 does include a threshold number of frames that are part of the action frames 215, an actionness label of 1 may be assigned. Thus, each input clip 205 may be labeled with an actionness probability, which in some examples, may be a binary classification of 1 or 0. [0076] Fig. 3 is a schematic diagram illustrating a graph of a confidence score 300 over the frames of a video with corresponding example frames, in accordance with various embodiments. Specifically, actionness probability 310 (pactionness) is the line graphed on top, and the confidence score 315 (/?') is the line graphed on the bottom. As previously described the actionness probability 310 may be a probability that an input clip is labeled a 1. As seen with respect to the frames 305a-305g, the actionness probability 310 may be a probability that an input clip is assigned an actionness label of 1 (action). Thus, each clip may have a respective actionness probability 315 based on the number of action frames contained in the input clip. A confidence score 315 may be determined based on the actionness probability 310, and specifically as a product of the actionness probability 315 and pi oc , which is a vector of length T+l, which represents the probability that an index (e.g., a respective frame) of a clip is the action start. In various embodiments, a starting point 310 is selected as the frame at which the confidence score 315 exceeds a threshold score.

[0077] Fig. 4 is a flow diagram of a method 400 for generic action start detection, in accordance with various embodiments. The method 400 begins, at block 405, by obtaining video data from a camera application. As previously described, a user device may initiate recording via a camera app, or video data from the camera app may otherwise be monitored, for example, by a video monitoring module. Accordingly, in various embodiments, video data (including one or more video frames) captured by the camera app may be passed to the GASD logic. In some examples, the camera app 105 may be configured to directly pass the video frames to the GASD logic, while in other examples, video frames may be stored in a buffer from which GASD logic may obtain the one or more video frames.

[0078] The method 400 may continue, at block 410, by segmenting the video data into a plurality of input clips. As previously described, the GASD logic may further include segmentation logic to segment a video file into one or more input clips. Video clips may be segmented using a sliding window of a given length of frames. The sliding window, for example, may be 16, 32, 64, 128, 256 and/or 512 frames in length. In some examples, a T-number of frames may be taken during each window to produce an input clip, of the plurality of input clips, having a length of T-frames.

[0079] At block 415, the method 400 continues by extracting features from each input clip of the plurality of input clips. As previously described, the input clips may be fed to or otherwise provided to a feature extractor. The feature extractor may include a feature extraction network. In some examples, the feature extractor may utilize a 3D convolutional network model. For example, the feature extractor may utilize a two-stream inflated 3D ConvNet (I3D) model. In further examples, the feature extractor 120 may include, without limitation, an FFN (e.g., MLP), other CNN, such as a recurrent neural network (RNN) or long short-term memory (LSTM), or a transformer-based network architecture. Thus, in various embodiments, the feature extractor may be configured to generate a feature map each of the one or more frames of the sampled input clip. The feature map may, in some examples, include low-level spatial-temporal features.

[0080] At block 420, the method 400 continues by determining an actionness probability. In various embodiments, an actionness module may be configured to obtain one or more respective feature maps of the one or more input clips. Based on the feature maps, the actionness module may be configured to determine an actionness probability. As previously described, determining an actionness probability may include determining a binary classification of an input clip as including action or not including action (e.g., background clip). The frames of an input clip predicted to be an action frame may be added to find the total number of action frames in the input clip. In some examples, the actionness module may be configured to assign a first value (e.g., 1) to an input clip if the actionness module predicts that the clip contains action, and a second value (e.g., 0) is assigned to the clip if it is predicted not to contain action. In some examples, the input clip may be labeled with a value of 1 if the ratio of action frames is larger than an intersection over union (loll) threshold. Accordingly, in some examples, the actionness module may first determine a total number of action frames out of the one or more input clips, and determine an loll of a respective input clip as a ratio of the number of action frames in a respective input clip and the total number of action frames.

[0081] The method 400 continues, at block 425, by determining an action start probability. As previously described, the positioning module may be a multi-class (T+l) classifier, where T is the number of frames in the input clip. In further embodiments, the positioning module 130 may be a multi-class (T+l) classifier, where T is the number of frames in a video clip (e.g., the input clip). The positioning module may thus set the index of action start, associated with the respective frame, to a value reflecting a probability that an action start exists at the respective frame (e.g., [0082] The method 400 includes, at block 430, determining a confidence score and predicted action start index. As previously described, the confidence score may be the product of the actionness probability (e.g., pactionness') and the action start probability (e.g., pi oc ) described above. Thus, the confidence score may indicate a confidence score that the input clip includes action, and an action start prediction of the frame of the input clip at which the action starts. In some embodiments, the predicted action start index may be determined based on a relative action start index within a given clip, which may be translated to an action start index for the entire video file by (e.g., an absolute AS index) via transform with a clip start index. In various embodiments, the AS starting point (e.g., AS index) may be selected as the frame at which the confidence score exceeds a threshold score, as described with respect to Fig. 3.

[0083] Based on the above determination, the method 400 may further include, at block 435, triggering a video recording based on an output prediction of both the confidence score and AS index. As previously described, GASD logic may be configured to control the camera app or transmit a command to the camera app to begin recording and/or storing video data.

Alternatively, GASD logic may, in some examples, be configured to determine and store previously stored (e.g., offline) video data based on the action start prediction. Accordingly, the GASD logic may be configured to automatically begin recording while eliminating delays which could lead to a user missing the desired event altogether. In further embodiments, a predicted AS index may be used to trigger recording (e.g., storing of video data) at a first instance that a confidence score is determined to be higher than a threshold confidence score.

[0084] In some examples, if no action start is detected in a frame, the frame may be ignored without saving. Thus, by determining when an action start is detected, and which clips contain an event or detected action start, GASD logic may provide more efficient storage of video data, and improve performance of downstream tasks, such as video summarization and video file management.

[0085] In further embodiments, the method 400 may further include, at block 440, determining an end point of an action. If an action start is detected, the frame and all subsequent frames may be saved until an end point. In some examples, a fixed recording duration may be used to determine an end point. In further examples, and end point may be reached when the user manually stops the recording. In yet further embodiments, an end point may be automatically determined by a frame, or series of frames, in which no action is detected. In some examples, the actionness module and/or positioning module may be used to determine whether a frame and/or series of one or more frames does not contain action (e.g., are background frames), and automatically stop recording at the camera app. Accordingly, GASD logic may automatically determine a point at which an even occurs (e.g., action start) and trigger video recording when action start is predicted.

[0086] The techniques and processes described above with respect to various embodiments may be performed by one or more computer systems Fig. 5 is a schematic block diagram of a computer system 500 for generic action start detection, in accordance with various embodiments. Fig. 5 provides a schematic illustration of one embodiment of a computer system 500, such as the systems 100A, 100B, or subsystems thereof, which may perform the methods provided by various other embodiments, as described herein. It should be noted that Fig. 5 only provides a generalized illustration of various components, of which one or more of each may be utilized as appropriate. Fig. 5, therefore, broadly illustrates how individual system elements may be implemented in a relatively separated or relatively more integrated manner.

[0087] The computer system 500 includes multiple hardware elements that may be electrically coupled via a bus 505 (or may otherwise be in communication, as appropriate). The hardware elements may include one or more processors 510, including, without limitation, one or more general-purpose processors and/or one or more special-purpose processors (such as microprocessors, digital signal processing chips, graphics acceleration processors, and microcontrollers); one or more input devices 515, which include, without limitation, a mouse, a keyboard, one or more sensors, and/or the like; and one or more output devices 520, which can include, without limitation, a display device, and/or the like.

[0088] The computer system 500 may further include (and/or be in communication with) one or more storage devices 525, which can comprise, without limitation, local and/or network accessible storage, and/or can include, without limitation, a disk drive, a drive array, an optical storage device, solid-state storage device such as a random-access memory ("RAM") and/or a read-only memory ("ROM"), which can be programmable, flash-updateable, and/or the like.

Such storage devices may be configured to implement any appropriate data stores, including, without limitation, various file systems, database structures, and/or the like.

[0089] The computer system 500 might also include a communications subsystem 530, which may include, without limitation, a modem, a network card (wireless or wired), an IR communication device, a wireless communication device and/or chipset (such as a Bluetooth™ device, an 802.11 device, a WiFi device, a WiMax device, a WWAN device, a Z-Wave device, a ZigBee device, cellular communication facilities, etc.), and/or a low-power wireless device. The communications subsystem 530 may permit data to be exchanged with a network (such as the network described below, to name one example), with other computer or hardware systems, between data centers or different cloud platforms, and/or with any other devices described herein. In many embodiments, the computer system 500 further comprises a working memory 535, which can include a RAM or ROM device, as described above.

[0090] The computer system 500 also may comprise software elements, shown as being currently located within the working memory 535, including an operating system 540, device drivers, executable libraries, and/or other code, such as one or more application programs 545, which may comprise computer programs provided by various embodiments, and/or may be designed to implement methods, and/or configure systems, provided by other embodiments, as described herein. Merely by way of example, one or more procedures described with respect to the method(s) discussed above might be implemented as code and/or instructions executable by a computer (and/or a processor within a computer); in an aspect, then, such code and/or instructions can be used to configure and/or adapt a general purpose computer (or other device) to perform one or more operations in accordance with the described methods.

[0091] A set of these instructions and/or code might be encoded and/or stored on a non- transitory computer readable storage medium, such as the storage device(s) 525 described above. In some cases, the storage medium might be incorporated within a computer system, such as the system 500. In other embodiments, the storage medium might be separate from a computer system (i.e., a removable medium, such as a compact disc, etc.), and/or provided in an installation package, such that the storage medium can be used to program, configure, and/or adapt a general purpose computer with the instructions/code stored thereon. These instructions might take the form of executable code, which is executable by the computer system 500 and/or might take the form of source and/or installable code, which, upon compilation and/or installation on the computer system 500 (e.g., using any of a variety of generally available compilers, installation programs, compression/decompression utilities, etc.) then takes the form of executable code. [0092] It will be apparent to those skilled in the art that substantial variations may be made in accordance with specific requirements. For example, customized hardware (such as programmable logic controllers, single board computers, FPGAs, ASICs, and SoCs) might also be used, and/or particular elements might be implemented in hardware, software (including portable software, such as applets, etc.), or both. Further, connection to other computing devices such as network input/output devices may be employed.

[0093] As mentioned above, in one aspect, some embodiments may employ a computer or hardware system (such as the computer system 500) to perform methods in accordance with various embodiments of the invention. According to a set of embodiments, some or all of the procedures of such methods are performed by the computer system 500 in response to processor 510 executing one or more sequences of one or more instructions (which might be incorporated into the operating system 540 and/or other code, such as an application program 545) contained in the working memory 535. Such instructions may be read into the working memory 535 from another computer readable medium, such as one or more of the storage device(s) 525. Merely by way of example, execution of the sequences of instructions contained in the working memory 535 might cause the processor(s) 510 to perform one or more procedures of the methods described herein.

[0094] The terms "machine readable medium" and "computer readable medium," as used herein, refer to any medium that participates in providing data that causes a machine to operate in a specific fashion. In an embodiment implemented using the computer system 500, various computer readable media might be involved in providing instructions/code to processor(s) 510 for execution and/or might be used to store and/or carry such instructions/code (e.g., as signals). In many implementations, a computer readable medium is a non-transitory, physical, and/or tangible storage medium. In some embodiments, a computer readable medium may take many forms, including, but not limited to, non-volatile media, volatile media, or the like. Non-volatile media includes, for example, optical and/or magnetic disks, such as the storage device(s) 525. Volatile media includes, without limitation, dynamic memory, such as the working memory 535. In some alternative embodiments, a computer readable medium may take the form of transmission media, which includes, without limitation, coaxial cables, copper wire and fiber optics, including the wires that comprise the bus 505, as well as the various components of the communication subsystem 530 (and/or the media by which the communications subsystem 530 provides communication with other devices). In an alternative set of embodiments, transmission media can also take the form of waves (including, without limitation, radio, acoustic, and/or light waves, such as those generated during radio-wave and infra-red data communications).

[0095] Common forms of physical and/or tangible computer readable media include, for example, a floppy disk, a flexible disk, a hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read instructions and/or code.

[0096] Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to the processor(s) 510 for execution. Merely by way of example, the instructions may initially be carried on a magnetic disk and/or optical disc of a remote computer. A remote computer might load the instructions into its dynamic memory and send the instructions as signals over a transmission medium to be received and/or executed by the computer system 500. These signals, which might be in the form of electromagnetic signals, acoustic signals, optical signals, and/or the like, are all examples of carrier waves on which instructions can be encoded, in accordance with various embodiments of the invention.

[0097] The communications subsystem 530 (and/or components thereof) generally receives the signals, and the bus 505 then might carry the signals (and/or the data, instructions, etc. carried by the signals) to the working memory 535, from which the processor(s) 510 retrieves and executes the instructions. The instructions received by the working memory 535 may optionally be stored on a storage device 525 either before or after execution by the processor(s) 510.

[0098] While some features and aspects have been described with respect to the embodiments, one skilled in the art will recognize that numerous modifications are possible. For example, the methods and processes described herein may be implemented using hardware components, software components, and/or any combination thereof. Further, while various methods and processes described herein may be described with respect to particular structural and/or functional components for ease of description, methods provided by various embodiments are not limited to any particular structural and/or functional architecture but instead can be implemented on any suitable hardware, firmware and/or software configuration. Similarly, while some functionality is ascribed to one or more system components, unless the context dictates otherwise, this functionality can be distributed among various other system components in accordance with the several embodiments.

[0099] Moreover, while the procedures of the methods and processes described herein are described in a particular order for ease of description, unless the context dictates otherwise, various procedures may be reordered, added, and/or omitted in accordance with various embodiments. Moreover, the procedures described with respect to one method or process may be incorporated within other described methods or processes; likewise, system components described according to a particular structural architecture and/or with respect to one system may be organized in alternative structural architectures and/or incorporated within other described systems. Hence, while various embodiments are described with or without some features for ease of description and to illustrate aspects of those embodiments, the various components and/or features described herein with respect to a particular embodiment can be substituted, added and/or subtracted from among other described embodiments, unless the context dictates otherwise. Consequently, although several embodiments are described above, it will be appreciated that the invention is intended to cover all modifications and equivalents within the scope of the following claims.