Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
RECOGNITION OF INTERESTING EVENTS IN IMMERSIVE VIDEO
Document Type and Number:
WIPO Patent Application WO/2017/087641
Kind Code:
A1
Abstract:
Systems and methods for video editing and playback are described. A video including a plurality of signals is received, where at least one signal represents an identifiable type of content over a length of the video. One or more portions of interest of the video are identified, each portion of interest being identified based on at least one of the signals, and each portion of interest being associated with a time of interest and at least one spatial focus point. During playback of the video by a user, upon reaching a time of interest associated with a particular portion of interest of the video, the field of view of the user is directed to the spatial focus point associated with the particular portion of interest.

Inventors:
WHITE SEAN MICHAEL (US)
MCCARTHY IAN (US)
Application Number:
PCT/US2016/062482
Publication Date:
May 26, 2017
Filing Date:
November 17, 2016
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
BRIGHTSKY LABS INC (US)
International Classes:
G11B27/28; G11B27/031; G11B27/32; G11B27/34
Domestic Patent References:
WO2010117213A22010-10-14
Foreign References:
EP2816564A12014-12-24
US20130081082A12013-03-28
US20070120986A12007-05-31
EP1708101A12006-10-04
US20150110413A12015-04-23
Other References:
None
Attorney, Agent or Firm:
ARGENRIERI, Steven, R. et al. (US)
Download PDF:
Claims:
What is claimed is:

1. A computer-implemented method comprising: receiving a video comprising a plurality of signals, at least one signal representing an identifiable type of content over a length of the video;

identifying one or more portions of interest of the video, each portion of interest being identified based on at least one of the signals, each portion of interest being associated with a time of interest and at least one spatial focus point; and

during playback of the video by a user, upon reaching a time of interest associated with a particular portion of interest of the video, directing a field of view of the user to the spatial focus point associated with the particular portion of interest.

2. The method of claim 1 , wherein the video is represented in three-dimensional space. 3. The method of claim 1, wherein each keyframe of the video is associated with a preferred spatial focus point.

4. The method of claim 1 , wherein each frame of the video is associated with a preferred spatial focus point.

5. The method of claim 1 , wherein the time of interest comprises a range of time. 6. The method of claim 1 , wherein a particular portion of interest is associated with a plurality of spatial focus points.

7. The method of claim 6, further comprising ranking the spatial focus points based on respective probabilities that each spatial focus point is interesting.

8. The method of claim 1, wherein directing the field of view comprises:

normalizing a plurality of spatial focus points over a period of time into a single spatial focus point; and

directing the field of view of the user to the single spatial focus point during the period of time.

9. The method of claim 1 , wherein the identifiable type of content for a particular signal is selected from the group consisting of motion, sound, presence of faces, recognized faces, recognized objects, recognized activities, and recognized scenes.

10. The method of claim 1, wherein identifying a particular portion of interest of the video comprises:

identifying at least one intermediate portion of interest in the video based on one or more of the signals;

associating a weighting with each of the one or more signals, wherein a particular weighting is determined based at least in part on historical attributes associated with at least one of an individual and a group of users; and

identifying the particular portion of interest of the video based on the at least one intermediate portion of interest and the one or more signal weightings.

1 1. The method of claim 10, wherein identifying a particular portion of interest of the video comprises:

combining the signals according to the respective weighting of each signal; identifying a portion of the combined signals that meets a threshold signal intensity; and

identifying as the particular portion of interest a portion of the video that corresponds to the identified portion of combined signals.

12. The method of claim 10, wherein identifying a particular portion of interest of the video comprises:

combining the signals according to the respective weighting of each signal; identifying a portion of the combined signals that comprises a high or low signal intensity relative to other portions of the combined signals; and

identifying as the particular portion of interest a portion of the video that corresponds to the identified portion of combined signals.

13. The method of claim 10, wherein a particular historical attribute associated with the individual comprises: a propensity of the individual to favor video content having a particular signal, a propensity of the individual to favor video content lacking a particular signal, a propensity of the individual to favor video content having a particular signal with a particular signal intensity, a propensity of the individual to disfavor video content having a particular signal, a propensity of the individual to disfavor video content lacking a particular signal, or a propensity of the individual to disfavor video content having a particular signal with a particular signal intensity.

14. The method of claim 10, wherein a particular historical attribute associated with the group of users comprises: a propensity of the group of users to favor video content having a particular signal, a propensity of the group of users to favor video content lacking a particular signal, a propensity of the group of users to favor video content having a particular signal with a particular signal intensity, a propensity of the group of users to disfavor video content having a particular signal, a propensity of the group of users to disfavor video content lacking a particular signal, or a propensity of the group of users to disfavor video content having a particular signal with a particular signal intensity. 15. The method of claim 1 , wherein at least one of the signals comprises sensor readings over a length of the video.

16. The method of claim 15, wherein the sensor comprises an accelerometer, a gyroscope, a heart rate sensor, a compass, a light sensor, a GPS, or a motion sensor.

17. A system comprising: at least one memory for storing computer-executable instructions; and

at least one processor for executing the instructions stored on the at least one memory, wherein execution of the instructions programs the at least one processor to perform operations comprising:

receiving a video comprising a plurality of signals, at least one signal representing an identifiable type of content over a length of the video;

identifying one or more portions of interest of the video, each portion of interest being identified based on at least one of the signals, each portion of interest being associated with a time of interest and at least one spatial focus point; and

during playback of the video by a user, upon reaching a time of interest associated with a particular portion of interest of the video, directing a field of view of the user to the spatial focus point associated with the particular portion of interest.

18. The system of claim 17, wherein the video is represented in three-dimensional space.

19. The system of claim 17, wherein each keyframe of the video is associated with a preferred spatial focus point.

20. The system of claim 17, wherein each frame of the video is associated with a preferred spatial focus point.

21. The system of claim 17, wherein the time of interest comprises a range of time. 22. The system of claim 17, wherein a particular portion of interest is associated with a plurality of spatial focus points.

23. The system of claim 22, wherein the operations further comprise ranking the spatial focus points based on respective probabilities that each spatial focus point is interesting.

24. The system of claim 17, wherein directing the field of view comprises:

normalizing a plurality of spatial focus points over a period of time into a single spatial focus point; and

directing the field of view of the user to the single spatial focus point during the period of time.

25. The system of claim 17, wherein the identifiable type of content for a particular signal is selected from the group consisting of motion, sound, presence of faces, recognized faces, recognized objects, recognized activities, and recognized scenes.

26. The system of claim 17, wherein identifying a particular portion of interest of the video comprises:

identifying at least one intermediate portion of interest in the video based on one or more of the signals;

associating a weighting with each of the one or more signals, wherein a particular weighting is determined based at least in part on historical attributes associated with at least one of an individual and a group of users; and

identifying the particular portion of interest of the video based on the at least one intermediate portion of interest and the one or more signal weightings.

27. The system of claim 26, wherein identifying a particular portion of interest of the video comprises: combining the signals according to the respective weighting of each signal;

identifying a portion of the combined signals that meets a threshold signal intensity; and

identifying as the particular portion of interest a portion of the video that corresponds to the identified portion of combined signals.

28. The system of claim 26, wherein identifying a particular portion of interest of the video comprises:

combining the signals according to the respective weighting of each signal; identifying a portion of the combined signals that comprises a high or low signal intensity relative to other portions of the combined signals; and

identifying as the particular portion of interest a portion of the video that corresponds to the identified portion of combined signals.

29. The system of claim 26, wherein a particular historical attribute associated with the individual comprises: a propensity of the individual to favor video content having a particular signal, a propensity of the individual to favor video content lacking a particular signal, a propensity of the individual to favor video content having a particular signal with a particular signal intensity, a propensity of the individual to disfavor video content having a particular signal, a propensity of the individual to disfavor video content lacking a particular signal, or a propensity of the individual to disfavor video content having a particular signal with a particular signal intensity.

30. The system of claim 26, wherein a particular historical attribute associated with the group of users comprises: a propensity of the group of users to favor video content having a particular signal, a propensity of the group of users to favor video content lacking a particular signal, a propensity of the group of users to favor video content having a particular signal with a particular signal intensity, a propensity of the group of users to disfavor video content having a particular signal, a propensity of the group of users to disfavor video content lacking a particular signal, or a propensity of the group of users to disfavor video content having a particular signal with a particular signal intensity.

31. The system of claim 17, wherein at least one of the signals comprises sensor readings over a length of the video.

32. The system of claim 31, wherein the sensor comprises an accelerometer, a gyroscope, a heart rate sensor, a compass, a light sensor, a GPS, or a motion sensor.

Description:
RECOGNITION OF INTERESTING EVENTS IN IMMERSIVE VIDEO

Cross-Reference to Related Application

[0001] This application claims priority to and the benefit of U.S. Provisional Patent Application No. 62/256,443, filed on November 17, 2015, and entitled "Recognition of Interesting Events in Immersive Video," the entirety of which is incorporated by reference herein.

Background

[0002] The present disclosure relates generally to media curation, editing and playback and, more particularly, to systems and methods for identifying portions of interest in 360-degree video and other forms of immersive video.

[0003] Creators of media content often generate substantially more content than is needed or used in a final audio and/or video production. Content creators may only be interested in showcasing the most interesting or relevant portions of their generated content to an audience. For example, a snowboarder may desire to exhibit his best tricks on video, while discarding intermediate portions of the video that show him boarding down a mountainside between jumps. While the snowboarder can upload his video and utilize complex video editing software to compile a highlights video once he has returned from the slopes, identifying interesting video segments, editing captured video, and sharing modified content in the midst of his excursion is exceedingly difficult. There is a need for systems and methods that facilitate the foregoing tasks for content creators.

Brief Summary

[0004] Systems and methods for video editing and playback are disclosed herein. In one aspect, a computer-implemented method comprises: receiving a video comprising a plurality of signals, at least one signal representing an identifiable type of content over a length of the video; identifying one or more portions of interest of the video, each portion of interest being identified based on at least one of the signals, each portion of interest being associated with a time of interest and at least one spatial focus point; and during playback of the video by a user, upon reaching a time of interest associated with a particular portion of interest of the video, directing a field of view of the user to the spatial focus point associated with the particular portion of interest.

[0005] Various implementations of this aspect include one or more of the following features. The video is represented in three-dimensional space. Each keyframe of the video is associated with a preferred spatial focus point. Each frame of the video is associated with a preferred spatial focus point. The time of interest comprises a range of time. A particular portion of interest is associated with a plurality of spatial focus points. The identifiable type of content for a particular signal is selected from the group consisting of motion, sound, presence of faces, recognized faces, recognized objects, recognized activities, and recognized scenes. At least one of the signals comprises sensor readings over a length of the video. The sensor comprises an accelerometer, a gyroscope, a heart rate sensor, a compass, a light sensor, a GPS, or a motion sensor.

[0006] In one implementation, the method further comprises ranking the spatial focus points based on respective probabilities that each spatial focus point is interesting. Directing the field of view can include: normalizing a plurality of spatial focus points over a period of time into a single spatial focus point; and directing the field of view of the user to the single spatial focus point during the period of time. Identifying a particular portion of interest of the video can include: identifying at least one intermediate portion of interest in the video based on one or more of the signals; associating a weighting with each of the one or more signals, wherein a particular weighting is determined based at least in part on historical attributes associated with at least one of an individual and a group of users; and identifying the particular portion of interest of the video based on the at least one intermediate portion of interest and the one or more signal weightings. Identifying a particular portion of interest of the video can include: combining the signals according to the respective weighting of each signal; identifying a portion of the combined signals that meets a threshold signal intensity; and identifying as the particular portion of interest a portion of the video that corresponds to the identified portion of combined signals. Identifying a particular portion of interest of the video can include:

combining the signals according to the respective weighting of each signal; identifying a portion of the combined signals that comprises a high or low signal intensity relative to other portions of the combined signals; and identifying as the particular portion of interest a portion of the video that corresponds to the identified portion of combined signals. [0007] A particular historical attribute associated with the individual can include: a propensity of the individual to favor video content having a particular signal, a propensity of the individual to favor video content lacking a particular signal, a propensity of the individual to favor video content having a particular signal with a particular signal intensity, a propensity of the individual to disfavor video content having a particular signal, a propensity of the individual to disfavor video content lacking a particular signal, or a propensity of the individual to disfavor video content having a particular signal with a particular signal intensity. A particular historical attribute associated with the group of users can include: a propensity of the group of users to favor video content having a particular signal, a propensity of the group of users to favor video content lacking a particular signal, a propensity of the group of users to favor video content having a particular signal with a particular signal intensity, a propensity of the group of users to disfavor video content having a particular signal, a propensity of the group of users to disfavor video content lacking a particular signal, or a propensity of the group of users to disfavor video content having a particular signal with a particular signal intensity.

[0008] Other aspects of the present invention include corresponding systems and computer readable media. The details of one or more implementations of the subject matter described in the present specification are set forth in the accompanying drawings and the description below. Other features, aspects, and advantages of the subject matter will become apparent from the description, the drawings, and the claims.

Brief Description of the Drawings

[0009] In the drawings, like reference characters generally refer to the same parts throughout the different views. Also, the drawings are not necessarily to scale, emphasis instead generally being placed upon illustrating the principles of the implementations. In the following description, various implementations are described with reference to the following drawings, in which:

[0010] FIG. 1 depicts an example system architecture for a video editing and playback system according to an implementation.

[0011] FIG. 2 depicts a flowchart of an example method for directing a user's field of view to an area of interest in a video. Detailed Description

[0012] Described herein in various implementations are systems and methods for editing, manipulating, and viewing media content. Media content can include digital media encoded in a machine-readable format, including but not limited to audio (e.g., sound recordings of events, activities, performances, speech, music, etc.), video (visual recordings of events, activities, performances, animation, etc.,), and other forms of media content usable in conjunction with the techniques described herein. Media content can also include streaming media (recorded or live). Video content can be presented in a two-dimensional or three-dimensional manner. For example, immersive video can be presented to a user via a virtual reality headset or similar device to provide a 360-degree (or less) experience.

[0013] FIG. 1 depicts an example high-level system architecture in which an application 1 15 on a user device 110 communicates with one or more remote servers 120 over

communications network 150. The user device 1 10 can be, for example, a smart phone, tablet computer, smart watch, smart glasses, virtual reality headset, portable computer, mobile telephone, laptop, palmtop, gaming device, music device, television, smart or dumb terminal, network computer, personal digital assistant, wireless device, information appliance, workstation, minicomputer, mainframe computer, or other computing device, that is operated as a general purpose computer or as a special purpose hardware device that can execute the functionality described herein.

[0014] The application 1 15 on the user device 1 10 can provide media playback and editing functionality to a device user. In one implementation, the application 115 provides a user interface that allows a user to browse through, manipulate, edit, and/or play media content (e.g., a video file, an audio file, etc.), and can include a visual representation of a timeline to control these actions. In another implementation, the application 115 analyzes media content to identify one or more portions of interest, which analysis can be based on a weighting of various signals associated with the content. As used herein, a "signal" refers to time-varying data describing an identifiable type of content in audio, video, or other media content or a portion thereof, including, but not limited to, motion data (e.g., displacement, direction, velocity, acceleration, orientation, angular momentum, and time), sound, geographic location, presence of faces, recognized faces, recognized objects, recognized activities, recognized scenes, and social graph information (presence of recognized social contacts, friends, connections, and the like, of varying degrees.). A signal can also refer to a time-varying or static attribute associated with media content or a portion thereof, including, but not limited to, popularity (e.g., measurement of likes, recommendations, sharing), context, sensor readings on a device (e.g., readings from an accelerometer, gyroscope, heart rate sensor, compass, light sensor, motion sensor, and the like), user label (e.g., a comment, hashtag, or other label that can provide hints as to the content of a media file), location, date, time, weather, and user-specified (e.g., manually -defined as interesting). Signal weighting data can be stored locally on the user device 1 10 and/or can be transferred to and received from remote server 120.

[0015] Remote server(s) 120 can aggregate signal weighting data, social experience information, and other media analytics received from user device 1 10 and other user devices 180 and share the data among the devices over communications network 150. In some implementations, remote server(s) 120 host and/or proxy media, webpages, and/or other content are accessible by the user device 1 10 via application 1 15. Remote server(s) 120 can also perform portions of the various processes described herein; for example, analysis of media content to identify signals can be performed in whole or in part remotely, rather than locally on the user device 1 10.

[0016] Third-party services 170 can include social networking, media sharing, content distribution, and/or other platforms through which a user can send, receive, share, annotate, edit, track, or take other actions with respect to media content using, e.g., application 1 15 via communications network 150. Third-party services 170 can include, but are not limited to, YouTube, Facebook, WhatsApp, Vine, Snapchat, Instagram, Twitter, Flickr, and Reddit.

[0017] Implementations of the present system can use appropriate hardware or software; for example, the application 1 15 and other software on user device 1 10 and/or remote server(s) 120 can execute on a system capable of running an operating system such as the Microsoft Windows® operating systems, the Apple OS X® operating systems, the Apple iOS® platform, the Google Android™ platform, the Linux® operating system and other variants of UNIX® operating systems, and the like. The software, can be implemented on a general purpose computing device in the form of a computer including a processing unit, a system memory, and a system bus that couples various system components including the system memory to the processing unit. [0018] Additionally or alternatively, some or all of the functionality described herein can be performed remotely, in the cloud, or via software-as-a-service. For example, as described above, certain functions, such as those provided by the remote server 120, can be performed on one or more servers or other devices that communicate with user devices 1 10, 180. The remote functionality can execute on server class computers that have sufficient memory, data storage, and processing power and that run a server class operating system (e.g., Oracle® Solaris®, GNU/Linux®, and the Microsoft® Windows® family of operating systems).

[0019] The system can include a plurality of software processing modules stored in a memory and executed on a processor. By way of illustration, the program modules can be in the form of one or more suitable programming languages, which are converted to machine language or object code to allow the processor or processors to execute the instructions. The software can be in the form of a standalone application, implemented in a suitable

programming language or framework.

[0020] Method steps of the techniques described herein can be performed by one or more programmable processors executing one or more computer programs to perform functions by operating on input data and generating output. Method steps can also be performed by, and apparatus can be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit). Modules can refer to portions of the computer program and/or the processor/special circuitry that implements that functionality.

[0021] Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors. Generally, a processor receives instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer are a processor for executing instructions and one or more memory devices for storing instructions and data. Information carriers suitable for embodying computer program instructions and data include all forms of non-volatile memory, including by way of example semiconductor memory devices, e.g., EPROM, EEPROM, and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM and DVD-ROM disks. One or more memories can store media assets (e.g., audio, video, graphics, interface elements, and/or other media files), configuration files, and/or instructions that, when executed by a processor, form the modules, engines, and other components described herein and perform the functionality associated with the components. The processor and the memory can be supplemented by, or incorporated in special purpose logic circuitry.

[0022] In some implementations, the user device 1 10 includes a web browser, native application, or both, that facilitates execution of the functionality described herein. A web browser allows the device to request a web page or other program, applet, document, or resource (e.g., from a remote server 120 or other server, such as a web server) with an HTTP request. One example of a web page is a data file that includes computer executable or interpretable information, graphics, sound, text, and/or video, that can be displayed, executed, played, processed, streamed, and/or stored and that can contain links, or pointers, to other web pages. In one implementation, a user of the user device 1 10 manually requests a resource from a server. Alternatively, the device 110 automatically makes requests with a browser application. Examples of commercially available web browser software include Microsoft® Internet Explorer®, Mozilla® Firefox®, and Apple® Safari®.

[0023] In other implementations, the user device 110 includes client software, such as application 115. The client software provides functionality to the device 110 that provides for the implementation and execution of the features described herein. The client software can be implemented in various forms, for example, it can be in the form of a native application, web page, widget, and/or Java, JavaScript, .Net, Silverlight, Flash, and/or other applet or plug-in that is downloaded to the device and runs in conjunction with a web browser. The client software and the web browser can be part of a single client-server interface; for example, the client software can be implemented as a plug-in to the web browser or to another framework or operating system. Other suitable client software architecture, including but not limited to widget frameworks and applet technology can also be employed with the client software.

[0024] A communications network 150 can connect user devices 1 10, 180 with one or more servers or devices, such as remote server 120. The communication can take place over media such as standard telephone lines, LAN or WAN links (e.g., Tl, T3, 56kb, X.25), broadband connections (ISDN, Frame Relay, ATM), wireless links (802.1 1 (Wi-Fi), Bluetooth, GSM, CDMA, etc.), for example. Other communication media are contemplated. The network 150 can carry TCP/IP protocol communications, and HTTP/HTTPS requests made by a web browser, and the connection between the client device and servers can be communicated over such TCP/IP networks. Other communication protocols are contemplated.

[0025] The system can also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules can be located in both local and remote computer storage media including memory storage devices. Other types of system hardware and software than that described herein can also be used, depending on the capacity of the device and the amount of required data processing capability. The system can also be implemented on one or more virtual machines executing virtualized operating systems such as those mentioned above, and that operate on one or more computers having hardware such as that described herein.

[0026] It should also be noted that implementations of the systems and methods can be provided as one or more computer-readable programs embodied on or in one or more articles of manufacture. The program instructions can be encoded on an artificially -generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. A computer storage medium can be, or be included in, a computer- readable storage device, a computer-readable storage substrate, a random or serial access memory array or device, or a combination of one or more of them. Moreover, while a computer storage medium is not a propagated signal, a computer storage medium can be a source or destination of computer program instructions encoded in an artificially-generated propagated signal. The computer storage medium can also be, or be included in, one or more separate physical components or media (e.g., multiple CDs, disks, or other storage devices).

[0027] In one implementation, the application 1 15 on the user device 1 10 automatically identifies one or more portions of media content that may be of interest to the device user. The automatic identification of portions of interest can also be performed remotely, wholly or in part, by, e.g., remote server 120. A portion of interest of media content can be defined temporally (e.g., the portion of interest occurs between times Tl and T2 of a video). Further, with certain types of media content, such as video and, in particular, video represented in three- dimensional space (e.g., 360-degree video), the portion of interest can also be defined spatially, and in combination with a temporal attribute (e.g., the portion of interest occurs between times Tl and T2 at rotations yaw, pitch, roll of a 360-degree video).

[0028] Portions of interest in media content can be automatically identified based on one or more signals associated with the content. As described above, a signal can represent an identifiable type of content within digital media (e.g., motion, sound, recognized faces (known or unknown people), recognized objects, recognized activities, recognized scenes, social graph information, and the like), as well as an attribute associated with the media (e.g., popularity, context, location, date, time, weather, news reports, and so on).

[0029] A signal can vary in intensity over the temporal length of an audio file, video file, live media, or other media content. With certain types of media content, such as video and, in particular, video represented in actual or simulated three-dimensional space (e.g., 360-degree video), a signal can also vary in intensity over spatial areas of the content. For example, in a 360-degree video, a spatial area S I directly in front of a viewer may be interesting at a first point in time Tl , while a spatial area S2 behind the viewer (outside of the viewer's field of view) may not be of interest at the same time Tl . At a later time T2, the action may move so that S2 is interesting and S I is not.

[0030] "Signal intensity," as used herein, refers to the presence of a particular content type in media content and, in some cases, the extent to which the content type exists in a particular portion of the media content. In the case of explicit signals and certain attributes associated with media content, signal intensity can be binary (e.g., exists or does not exist). For content types such as motion, sound, facial recognition, and so on, as well as certain sensor readings the intensity can be a function of the concentration of the content type in a particular portion of the media content, and can, for example, vary over a fixed range or dynamic range (e.g., defined relative to the intensities over the signal domain and/or relative to other signals), or fall into defined levels or tiers (e.g., zero intensity, low intensity, medium intensity, high intensity). In the case of motion content, portions of media content that are determined to have higher instances of movement (or a particular type of movement indicative of a particular activity such as, for example, skiing or bicycle riding) will have correspondingly higher motion intensity levels. As another example, the intensity of audio content can be determined based on the loudness of audio in a particular portion of media content. For general facial recognition, intensity can be based on the number of identified faces in a particular portion of a video (e.g., more faces equals higher intensity). For known facial recognition, intensity can be based on the number of identified faces that are known to a user in a particular portion of a video (e.g., friends, family, social networking connections, etc.). For social graph information, intensity can be similar to that for facial recognition, and can also be based on the degree of a relationship (e.g., recognition of a direct friend would have a higher signal intensity than recognition of a friend-of-a-friend). In the case of external sensor readings associated with media content (e.g., an accelerometer in a smartphone), intensity can be based on the amount strength of the readings detected by the sensor (e.g., for the accelerometer, stronger movement readings equals higher intensity).

[0031] Certain signals are considered "implicit," as they can be automatically identified based on the media content or an associated attribute. Implicit signals can include motion, sound, facial/object recognition, popularity, context, and so on. Other signals are "explicit," in that they can include manually defined elements. For example, a user can manually tag a portion of a video prior to, during, or after recording to indicate that the portion should be considered interesting. In some implementations, while recording audio and/or video, the user manipulates a control (e.g., a button) on a recording device, on the user device 1 10, or on another external device (e.g., a wirelessly connected ring, wristwatch, pendant, or other wearable device) in communication (e.g., via Bluetooth, Wi-Fi, etc.) with the recording and/or user device 110, to indicate that an interesting portion of the audio/video is beginning. The user can then manipulate the same or a different control a second time to indicate that the interesting portion has ended. The period between the start and end time of the interesting portion can then be considered as having a "user-tagged" signal.

[0032] Another type of signal, referred to herein as a "social signal," indicates the level of social interest over the length and/or spatial areas of media content. A social signal can be based on the presence of likes, comments, recommendations, shares, and/or other indications of social interest and can have an associated intensity that varies in relation to popularity. For example, portions of media content that have a higher concentration of likes, comments, and other indicators of interest relative to other portions of the media content will have a higher social signal intensity. In some implementations, the social signal can be used in further refining signal weightings (described below) for the corresponding media content or other media content. In one example, if a video is published in which the motion signal was heavily weighted compared to other signals, but users, upon viewing the video, prefer portions of the video in which many faces appear (i.e., the social signal has a higher intensity at these face portions), then the social signal may cause future weightings of the media content or other media content to be biased more toward the facial recognition signal. More specifically, social signals can become part of the training data that influences the determination of signal weights.

[0033] FIG. 2 depicts one implementation of a method 200 for identifying a portion of interest of media content (in this example, a video). In STEP 202, a video is received (e.g., downloaded, copied, streamed, or otherwise provided to user device 1 10, remote server 120, or other processing device). The video can include one or more signals, such as those signals described above. For at least one of the signals, an intermediate portion of interest in the video is identified based on the respective signal (STEP 206). A particular intermediate portion of interest of the video can be determined based on the intensity of a signal associated with that portion. For example, if a certain portion of the video has an incidence of loud noise relative to other portions of the video, that certain portion can be considered an intermediate portion of interest based on the intensity of the audio signal. In some implementations, intermediate portions of interest can be identified based on the intensity of multiple signals within the respective portions.

[0034] In STEP 210, a weighting is associated with at least one of the signals. For example, only motion and facial recognition might be considered important for a particular video, so only those signals are given a non-zero weighting. In another instance, explicit signals are not included in the weighting. The weighting can be personal to a particular user, general based on other users, or a combination of both. More specifically, the weighting can be determined based on historical attributes associated with a media content editor (e.g., the user of the application 1 15, another individual that is recognized for creating popular media content, or other person or entity) and/or historical attributes associated with a group of users (e.g., users who have created media content with other application instances, users who have expressed interest in media content created by the application user, and/or other group of users whose actions can contribute to a determination of the importance of a particular signal relative to other signals).

[0035] For example, if a user creates skydiving videos and frequently indicates that portions containing a high signal intensity for sound are the most interesting to him (e.g., by sharing videos that often contain such high-signal-intensity portions), the system can allocate a higher weighting to the sound signal relative to other signals (e.g., sound is weighted at 60%, while the remainder of the signals make up the remaining 40% weighting). This weighting can be applied to other videos edited by the user and, in some instances, can be combined with weightings based on the preferences of user groups, as indicated above. If combined, individual and group weightings can be weighted equally (e.g., as an initial default weighting), or in other instances, one type can have a greater weight than the other. For example, if there is little or no training data available for a particular individual, the weightings based on user group preferences can be weighted more heavily. In some implementations, signal weighting is also dependent on the context or other attribute(s) associated with particular media content. For instance, if the user prefers high intensity sound signals in his skydiving videos, but prefers high intensity motion signals in his snowboarding videos, the system can weight the signals differently based on whether the user is editing one type of video or the other.

[0036] Historical attributes of a content editor and/or group of users can include the following: a propensity of the editor/group to favor media content having a particular signal (e.g., sound is preferred over motion, recognized faces, etc.), a propensity of the editor/group to favor media content lacking a particular signal (e.g., a video without recognized faces is preferred) , a propensity of the editor/group to favor media content having a particular signal with a particular signal intensity (e.g., a high intensity of motion is preferred in an action- oriented video), a propensity of the editor/group to disfavor media content having a particular signal (e.g., portions of a video which an ex-girlfriend's face appears are disfavored), a propensity of the editor/group to disfavor media content lacking a particular signal (e.g., video without user-tagged portions is disfavored), and a propensity of the editor/group to disfavor media content having a particular signal with a particular signal intensity (e.g., portions of a concert recording with a low intensity sound signal are disfavored).

[0037] The system can refine the weightings it applies to particular signals as data is collected over time relating to user and group preferences of the signals and signal intensities. In some implementations, the weighting process is facilitated or automatically performed using machine learning, pattern recognition, data mining, statistical correlation, support vector machines, Gaussian mixture models, and/or other suitable known techniques. In one example, signal attributes associated with particular weightings can be viewed as vectors in a multidimensional space, and the similarity between signal attributes of unweighted signals and signals with particular weightings (e.g., weightings that reflect preferred or otherwise popular media portions by the user and/or other users) can be determined based on a cosine angle between vectors or other suitable method. If the similarity meets a threshold, an unweighted signal can be assigned the weighting of the similar signal vector.

[0038] As another example, a classifier (e.g., a suitable algorithm that categorizes new observations) can be trained over time using various historical data, such as the historical attributes referred to above. A classifier can be personal to an individual user, and use training data based only on that user's signal preferences and other data. Other classifiers can be trained based on data associated with the preferences of a group of users. For instance, each time an editor shares media content or otherwise indicates that a portion of the media content is of interest, the signal information associated with the (portion of the) media content (e.g. signal preference, signal intensity preference, etc.) can be stored on the user device 110 and/or transferred to remote server 120 for use as training data to improve future weightings for the editor and/or groups of users. The input to such a classifier (e.g., upon creating new media content or opening a media file) can include signal data, intensity data, media content attribute data, and other information associated with the media content. The classifier can then determine, based on the input and the training data, an appropriate weighting of signals for the media content.

[0039] Still referring to FIG. 2, in STEP 214, one or more overall portions of interest of the media content are identified based on the intermediate portion(s) of interest and the signal weighting(s). An overall portion of interest can be identified by combining the signals according to their respective weightings, and selecting a portion of the combined signals (corresponding to a portion of the media content) that meets a threshold signal intensity.

Alternatively or in addition, the top or bottom N combined signal intensity points (e.g., top/bottom one, top/bottom three, top/bottom five, top/bottom ten, etc.) can be used to determine the overall points of interest. For example, the top three points (in non-overlapping regions) can be identified, and the segments of the media content that surround each point (e.g., +/- N seconds on either side) can be considered overall portions of interest. To illustrate, when a user creates or opens a video via the application 115, the application 1 15 can suggest one or more portions of the video that might be of interest to the user (e.g., by a suitable form of visual indication), based on signals in the video and weightings determined based on the user, another user, and/or groups of users. In one implementation, the application 115 presents different signal weightings (and, in some cases, the corresponding portions of interest) to the user (e.g., a weighting based on the user's preferences, a weighting based on an expert's preferences, and/or a weighting based on a group of users' preferences) and allows the user to select which weighting(s) and/or portions of interest the user prefers.

[0040] In some implementations, the media content being viewed or edited by a user is presented in three-dimensional space, e.g., through a virtual reality or other system having similar functionality. Such systems allow a user to experience up to the full 360 degrees of an environment. These videos often allow a user to immerse themselves in a 360-degree viewable representation of a space, in some cases not navigable, in some cases navigable via orientation, and in other cases navigable via orientation and position. Portions of interest of such immersive videos can be defined temporally as well as spatially, and some or all individual frames, keyframes, and/or segments of a video can be associated with a preferred focus point (e.g., defined by yaw, pitch, and roll) that defines where a user's field of view should be directed when viewing the video. As described above, a portion of interest of a video can be determined based on an analysis of signals and their corresponding intensities at each point in time. Similarly, for a 360-degree video, spatial areas of interest in the video at a particular point in time can also be identified using such signals. Thus, in STEP 218, during presentation of the video, a user's field of view can be guided to the interesting spatial area in the 360- degree video at each point or range of time where an portion of interest exists. Of note, the identification of areas of interest can be performed in real time. Thus, in the case of a live video stream, the user's field of view can be directed to the spatial area of an interesting event that is currently occurring in the video.

[0041] In one illustrative example, a 360-degree video captured on a Venice Beach boardwalk includes a skater moving at a location at 90 degrees (yaw). If the user is viewing the video at 180 degrees (yaw), they will miss the interesting action based on the limited human and display system fields of view. Using signals such as motion, human face recognition, and sound, the present system can recognize that the skater is the interesting part of the 360-degree view for a certain period of time and, in playback, the user's field of view can be automatically rotated to ensure that the skater remains within view. In other instances, the user is given some control over how or whether his field of view is changed. For example, he can be notified of a portion of interest and guided toward it while manually changing his field of view. As another example, the user can "clutch" the view to speed, slow, or stop the movement of the field of view, whether it is being moved manually or automatically. [0042] In one implementation, multiple spatial points or areas of interest are tracked for a particular point or period of time. The spatial areas of interest can be ranked based on respective likelihoods that each area is interesting. During playback of the video, a user's view can be directed to the highest ranking spatial area of interest. In some implementations, the user is permitted to toggle among all of the multiple spatial areas of interest or a subset thereof (e.g., the highest three ranking areas). Spatial points of interest can also be normalized over a particular period of time. For example, if an interesting event occurs within a particular spatial area and the preferred spatial focus point varies within that area from frame to frame, a single preferred spatial focus point can be selected for that period of time in order to steady the user's field of view and avoid noticeable jitter. Various techniques for determining how to select a spatial point of interest, how to identify a spatial area in which spatial points of interest should be normalized, and how to select a time period for normalization are contemplated.

[0043] Based on identified portions of interest, a 360-degree video can also be transformed into a video of reduced size or dimension. For example, a 360-degree video can be reformed into a two-dimensional video presentation that tracks the automatic or manual changes in field of view that would occur to view the spatial areas of interest during playback of the 360-degree video. This allows for viewers with less bandwidth or storage, or an inability to view 360- degree videos, to experience the interesting portions of such a video.

[0044] The terms and expressions employed herein are used as terms and expressions of description and not of limitation, and there is no intention, in the use of such terms and expressions, of excluding any equivalents of the features shown and described or portions thereof. In addition, having described certain implementations in the present disclosure, it will be apparent to those of ordinary skill in the art that other implementations incorporating the concepts disclosed herein can be used without departing from the spirit and scope of the invention. The features and functions of the various implementations can be arranged in various combinations and permutations, and all are considered to be within the scope of the disclosed invention. Accordingly, the described implementations are to be considered in all respects as illustrative and not restrictive. The configurations, materials, and dimensions described herein are also intended as illustrative and in no way limiting. Similarly, although physical explanations have been provided for explanatory purposes, there is no intent to be bound by any particular theory or mechanism, or to limit the claims in accordance therewith.