Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
IMAGE COLLECTION FEATURE IDENTIFICATION AND FILTERING
Document Type and Number:
WIPO Patent Application WO/2022/224016
Kind Code:
A1
Abstract:
A method and system is provided for processing images to group the images according to scenes and to detect objects such as face within the grouped images and score or rank the images with respect to the detected objects and visual attributes of each object. The ranking is used to discard images that are undesirable. The system uses a convolutional neural network and provides an interface for reviewing aspects of the processed images.

Inventors:
BROADBENT JAMES ZACHARY WILLIAM (NZ)
LEVET STEFFAN GREGORY (NZ)
COXON LIAM JAMES ARTHUR (NZ)
SARTEN NICHOLAS JAMES (NZ)
SEIDENBERG JULIAN MALIK (NZ)
WANG YIJUN (NZ)
DURLING MALCOLM THOMAS (NZ)
KAPLAN JOSHUA ADAM (NZ)
FISHER ROBERT MANDENO (NZ)
SCIVALLY SCHELL CARL (NZ)
Application Number:
PCT/IB2021/053311
Publication Date:
October 27, 2022
Filing Date:
April 21, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
SOFTWARE LTD (NZ)
International Classes:
G06K9/62
Foreign References:
US20170046561A12017-02-16
US10579897B22020-03-03
US10565472B22020-02-18
US20060280427A12006-12-14
Other References:
ANONYMOUS: "[Select] How to use Narrative Select | Narrative Help Center", INTERNET ARCHIVE WAYBACK MACHINE, 27 February 2021 (2021-02-27), pages 1 - 13, XP093001111, Retrieved from the Internet [retrieved on 20221123]
KUZOVKIN DMITRY: "Assessment of photos in albums based on aesthetics and context", THESIS, 1 January 2019 (2019-01-01), pages v-xi, 1 - 171, XP093001114, Retrieved from the Internet [retrieved on 20221123]
Attorney, Agent or Firm:
AJ PARK (NZ)
Download PDF:
Claims:
CLAIMS

1. A method for processing image data using a computer system, comprising: receiving image data for each of a plurality of images; processing the image data for each image to detect one or more objects in each image and one or more attributes of each object; generating for each image an object score for position or size of the object; generating for each image an attribute score for one or more visual attributes of the object; generating for each image a combined score utilising the object score and the feature score; rejecting one or more images with the lowest combined score.

2. A method as claimed in claim 1 wherein the objects belong to a class of objects such as faces.

3. A method as claimed in claim 1 or claim 2 wherein the one or more attributes comprises one or more of an eye state, an expression state, a focus state.

4. A method as claimed in any one of the preceding claims wherein the eye state comprises one or more of eye or eyes: closed; covered; obscured by glasses; mid blink; open; partially open.

5. A method as claimed in any one of the preceding claims wherein the expression state comprises one or more of: closed smile; frowning; neutral closed; open smile; open wide smile; relaxed open; talking; tight; tongue out

6. A method as claimed in any one of the preceding claims wherein the focus state comprises one or more of: in focus; partially in focus; partially out of focus; out of focus.

7. A method as claimed in any one of the preceding claims wherein the object score comprises a relative position of the object in the image and/or a relative size of the object in the image and/or a relative size of the object to other detected objects in the image.

8. A method as claimed in any one of the preceding claims wherein the scores are statistically processed to determine which image(s) to reject.

9. A method for processing image data using a computer system, comprising: receiving image data for an image; processing the image data to detect at least one object and a plurality of attributes or features in the or each object; generating a feature score for each of the plurality of features within the or each object; annotating the image with indicia to identify the object and to represent the feature score(s) for one or more of the features.

10. A method for processing image data using a computer system, comprising: receiving image data for an image; processing the image data to detect a plurality of objects and a plurality of features in each object; generating a feature score for each of the plurality of features within the or each object; annotating the image with indicia to identify the object and to represent the feature score(s) for one or more of the features

11. A method for displaying a selected image comprising: retrieving a low-resolution representation of the selected image determining a full resolution of the selected image scaling the low-resolution image to the full resolution rendering the scaled low-resolution image, and exchanging the rendered scaled low-resolution image for the selected image. 12. The method of claim 11 further comprising generating the selected image while the low-resolution image is being rendered.

13. A method for processing an image using a computer system to detect an object in an image and one or more features in the object, comprising: receiving image data for an image; convolving the image data to produce a plurality of convolution blocks, each block comprising a plurality of approximated feature maps; downsampling and upsampling convolution blocks to generate parallel hierarchical streams of blocks with feature maps of different resolutions; providing lateral and hierarchical cross connections between blocks in different streams; providing output from one or more convolution blocks to one or more detectors to determine one or more objects and/or attributes of each object in the image.

14. The method of claim 13 further comprising providing training data.

15. A method for processing a plurality of images using a computer system to determine which images within the plurality of images are similar to each other, the method comprising: receiving image data for each image; convolving each image to obtain image vector data; comparing image vector data of one or more neighbouring images either in image capture or image capture sequence.

16. A method for processing a plurality of images using a computer system to determine which images within the plurality of images are similar to each other, the method comprising: receiving image data for each image; training a VAE using the plurality of images by comparing one or more reconstructions of at least one of the images; using the VAE encoders to provide image vector data; comparing the image vector data.

17. A CNN configured to perform any one or more of the foregoing methods.

18. A computer device or network configured or programmed to perform any one or more of the foregoing methods.

19. Any novel feature or combination of features disclosed herein.

Description:
IMAGE COLLECTION FEATURE IDENTIFICATION AND FILTERING

FIELD

This disclosure relates generally to methods and systems for detecting images that lack desirability, so that undesirable images may be automatically filtered out of an image set or collection of images. The disclosure has relevance to use of machine learning for semantic image grouping, image feature identification, and ranking.

BACKGROUND

With the growth of digital technology and in particular digital photography, there is an ever-larger number of photographs being captured. These are simply stored by some users, but for some, such as professional photographers, the images that have been acquired must be sorted to identify and remove those that are perceived to lack desirability or quality.

One pertinent example is wedding photography. A photographer on an eight hour wedding shoot may take 2000 photos, but then need to select, edit and deliver perhaps 400 photos to the bride and groom. The time and effort required to view such a large number of photographs is daunting and can be very inefficient. Whether the task is performed by a professional or otherwise, each image needs to be viewed in some detail to determine its desirability.

One issue with this task is simply reviewing the overall composition, but another is detecting undesirable features which are present in more detailed parts of the images. For example, it may not be apparent on a cursory review that the eyes of a key member of the wedding party are closed in an otherwise highly desirable image. For a photographer or similar user to review an image in detail usually requires the image to be expanded or enlarged to some degree. This becomes problematic when working with high resolution images since many have an embedded lower resolution image that is used for selection purposes. Therefore, enlarging the low resolution image tends not to help improve viewability of the details of the image. The solution is to load the raw or native high resolution image, but this takes time to load, making the whole process very time consuming. Even for the amateur photographer, the ease of use of digital cameras (including smart phones) has led to an increase in the number of digital images being captured. The camera devices have large storage capacity, but at some point, whether on the device or in a remote repository such as cloud storage, capacity is reached. The user then has the task of choosing which images to dispose of in order to maintain a library of a reasonable or permitted size. As with the example above, the sorting process can be time consuming, especially if detail in the images is reviewed as part of the sorting or filtering process.

There is a need to alleviate the above problems. It is an object of the present disclosure to provide a method or system which overcomes or ameliorates one of more of the disadvantages of existing solutions, or which at least provides a useful alternative.

SUMMARY

One or more embodiments described below provide benefits and/or solve one or more of the foregoing or other problems in the art with systems, methods, and computer readable media that automatically filter, rank or score images dependent on objectively determined image characteristics.

In this manner, the disclosed systems, methods, and computer readable media intelligently and automatically filter images in a collection of images, thereby eliminating the need for users to manually select images, and the time and effect associated therewith.

In one aspect a method is provided for processing image data using a computer system, comprising: receiving image data for each of a plurality of images; processing the image data for each image to detect one or more objects in each image and one or more attributes of each object; generating for each image an object score for position or size of the object; generating for each image an attribute score for one or more visual attributes of the object; generating for each image a combined score utilising the object score and the feature score; rejecting one or more images with the lowest combined score.

In one embodiment the objects belong to a class of objects such as faces.

In one embodiment the one or more attributes comprises one or more of an eye state, an expression state, a focus state.

In one embodiment the eye state comprises one or more of eye or eyes: closed; covered; obscured by glasses; mid blink; open; partially open.

In one embodiment the expression state comprises one or more of: closed smile; frowning; neutral closed; open smile; open wide smile; relaxed open; talking; tight; tongue out

In one embodiment the focus state comprises one or more of: in focus; partially in focus; partially out of focus; out of focus.

In one embodiment the object score comprises a relative position of the object in the image and/or a relative size of the object in the image and/or a relative size of the object to other detected objects in the image.

In one embodiment the scores are statistically processed to determine which image(s) to reject.

In another aspect a method is provided for processing image data using a computer system, comprising: receiving image data for an image; processing the image data to detect at least one object and a plurality of attributes or features in the or each object; generating a feature score for each of the plurality of features within the or each object; annotating the image with indicia to identify the object and to represent the feature score(s) for one or more of the features.

In another aspect a method is provided for processing image data using a computer system, comprising: receiving image data for an image; processing the image data to detect a plurality of objects and a plurality of features in each object; generating a feature score for each of the plurality of features within the or each object; annotating the image with indicia to identify the object and to represent the feature score(s) for one or more of the features

In another aspect a method is provided for displaying a selected image comprising: retrieving a low-resolution representation of the selected image determining a full resolution of the selected image scaling the low-resolution image to the full resolution rendering the scaled low-resolution image, and exchanging the rendered scaled low-resolution image for the selected image.

In one embodiment the method further comprises generating the selected image while the low-resolution image is being rendered.

In another aspect a method is provided for processing an image using a computer system to detect an object in an image and one or more features in the object, comprising: receiving image data for an image; convolving the image data to produce a plurality of convolution blocks, each block comprising a plurality of approximated feature maps; downsampling and upsampling convolution blocks to generate parallel hierarchical streams of blocks with feature maps of different resolutions; providing lateral and hierarchical cross connections between blocks in different streams; providing output from one or more convolution blocks to one or more detectors to determine one or more objects and/or attributes of each object in the image.

In one embodiment the method further comprises providing training data.

In another aspect a method is provided for processing a plurality of images using a computer system to determine which images within the plurality of images are similar to each other, the method comprising: receiving image data for each image; convolving each image to obtain image vector data; comparing image vector data of one or more neighbouring images either in image capture or image capture sequence.

In another aspect a method is provided for processing a plurality of images using a computer system to determine which images within the plurality of images are similar to each other, the method comprising: receiving image data for each image; training a VAE using the plurality of images by comparing one or more reconstructions of at least one of the images; using the VAE encoders to provide image vector data; comparing the image vector data.

In another aspect a CNN is provided configured to perform any one or more of the foregoing methods.

In another aspect a computer device or network is provided configured or programmed to perform any one or more of the foregoing methods. Additional features and advantages of exemplary embodiments of the present disclosure will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of such exemplary embodiments.

Other aspects of the invention may become apparent from the following description which is given by way of example only and with reference to the accompanying drawings.

In this specification where reference has been made to patent specifications, other external documents, or other sources of information, this is generally for the purpose of providing a context for discussing the features of the invention. Unless specifically stated otherwise, a reference to such external documents is not to be construed as an admission that such documents, or such sources of information, in any jurisdiction, are prior art, or form part of the common general knowledge in the art.

It is also to be understood that the specific devices illustrated in the attached drawings and described in the following description are simply exemplary embodiments of the invention. Hence, specific dimensions and other physical characteristics related to the embodiments disclosed herein are not to be considered as limiting.

It is acknowledged that the term "comprise" may, under varying jurisdictions, be attributed with either an exclusive or an inclusive meaning.

For the purpose of this specification, and unless otherwise noted, the term 'comprise' shall have an inclusive meaning, allowing for inclusion of not only the listed components or elements, but also other non-specified components or elements. The terms 'comprises' or 'comprised' or 'comprising' have a similar meaning when used in relation to the system or to one or more steps in a method or process.

As used hereinbefore and hereinafter, the term "and/or" means "and" or "or", or both. As used hereinbefore and hereinafter, "(s)" following a noun means the plural and/or singular forms of the noun.

When used in the claims and unless stated otherwise, the word 'for' is to be interpreted to mean only 'suitable for', and not for example, specifically 'adapted' or 'configured' for the purpose that is stated.

For the purpose of this specification, where method steps are described in sequence, the sequence does not necessarily mean that the steps are to be chronologically ordered in that sequence, unless there is no other logical manner of interpreting the sequence.

The entire disclosures of all applications, patents and publications, cited above and below, if any, are hereby incorporated by reference.

BRIEF DESCRIPTION OF THE DRAWINGS

Various embodiments will be described and explained with additional specificity and detail with reference to the accompanying drawings in which:

FIG. 1 illustrates a system overview diagram in accordance with one or more embodiments;

FIG. 2 illustrates a tiled collection of zoomed objects derived from the photograph of Figure 4 in accordance with one or more embodiments;

FIG. 3 illustrates a zoomed portion of the photograph of Figure 4 in accordance with one or more embodiments;

FIG 3A illustrates a diagram showing an example of annotation in connection with an image in accordance with one or more embodiments;

FIG. 4 illustrates an annotated photographic image in accordance with one or more embodiments; FIG. 5 illustrates a series of images in accordance with one or more embodiments;

FIG. 6 illustrates a diagram of a process in accordance with one or more embodiments;

FIG. 7 illustrates a simplified diagram of a convolutional neural network;

FIG. 8 illustrates a diagram of use of one or more variational auto encoders accordance with one or more embodiments;

FIG. 9 illustrates a diagram illustrating use of a convolutional encoder- decoder;

FIG. 10 illustrates another diagram of a convolutional encoder- decoder;

FIG. 10A and 10B illustrate training and use respectively of a convolutional encoder-decoder in accordance with one or more embodiments;

FIG. 11 and 11B illustrate a method for comparing image vectors to determine scenes in accordance with one or more embodiments;

FIG. 12 illustrates a sequence of images comprising at least two scenes in accordance with one or more embodiments;

FIG. 13 illustrates a graph;

FIG. 14 illustrates a process, method or system for scoring and/or selecting images in accordance with one or more embodiments;

FIG. 15 illustrates a flow chart for scoring and/or selecting images in accordance with one or more embodiments; FIG. 16 illustrates a CNN in accordance with one or more embodiments;

FIG. 17 illustrates training images in accordance with one or more embodiments;

FIG. 18 illustrates a diagram of a network environment in accordance with one or more embodiments; and

FIG. 19 illustrates a diagram of an example of a computing device in accordance with one or more embodiments.

DETAILED DESCRIPTION

One or more embodiments of the present disclosure include an image selection system that automatically filters or removes one or more images from a plurality of images based on one or more automatically identified features. The automatic detection uses a ranking technique that detects and identifies one or more undesirable aspects or features of the one or more images.

Referring Figure 1 an overview of a system 100 according to at least one embodiment is shown diagrammatically. As will be described further below, the system, which embodies one or more processes, is in practice implemented on a computer system comprising at least one computing device executing instructions in memory provided as software on a non-transitory computer readable medium to enable the system or device to perform the processes described herein.

The system 100 takes a collection of images 102-106 and processes the images to detect objects in the images and to detect visual attributes of those objects, as will be described further below. In some embodiments system 100 uses a semantic image grouping algorithm 108 to group the images 102-106 images into semantically related groups of the same scene and thus provide a scene grouping 110. It will be understood that the three images shown in Figure 1 are merely representative of a plurality of images. In practical examples there may be tens, hundreds or thousands of images that are input to the system 100 to be processed. It will also be understood that in other embodiments the input images may have already been grouped satisfactorily so that algorithm 108 does not have to be used.

A "scene" is defined in this disclosure as a collection of similar images, for example a collection of similar photographs. In practice, a scene typically comprises a collection of similar photographs of which a photographer would only pick some, for example one or two, to keep out of a larger collection of photographs. The output of the semantic image grouping algorithm 108 is a scene grouping 110 of the input images that can be displayed in a software application on an appropriate display for viewing by a user in a number of ways, including for example without limitation: as a stack of images, a set of images visually clustered together, or some indicia such as a visual identifier, for example a bar or line connecting related images together.

The system 100 then takes all the images in the scene grouping 110 and uses a feature detection machine learning algorithm to detect one or more defined features present in the image. This occurs at step 112 of Figure 1. Feature detection can encompass a wide variety of different possible features that may be present in an image. In this specification the features to be detected comprise certain objects, and visual attributes of or within those objects. The system 100 is trained to detect the objects and related attributes as will be described further below. In some embodiments the objects for detection comprise faces, and the visual attributes comprise, without limitation, eye state, expression state and focus state. Therefore, some embodiments use an algorithm to find all the faces within the or each image. Other embodiments find other objects in an image, for example all the animals, or all the vehicles, etc. In many embodiments the common pattern is that the algorithm employed finds an object that is of interest to the user, for example the photographer.

In an embodiment a user interface for the application displays the object of interest to the photographer in a close-up panel that displays tiles of zoomed-in close-ups of each of the items of interest. This is identified as step 114 in Figure 1. The application or interface also allows the user to select (for example by pressing a button on the interface) and zoom into the object. The zooming function gives the user or photographer greater context, while the tiling panel of all the objects in the image (or for some or all of the images comprising a scene) gives the photographer an overview. This is shown in Figures 2 and 3, which are discussed further below.

Figure 3 portrays a zoomed image 300 of an original photographic image 400 (shown in Figure 4) including six people. The zoomed image 300 of Figure 3 is zoomed into the object of greatest interest, being the central face in the original photograph. The application has visually identified the face of interest with boxed rectangle 302 in Figure 3 and annotated it with further signifiers 304 and 306 in Fig 3. Figure 2 shows the detected objects, comprising all the faces in image 400 linked for ease of review. In this example the faces are arranged as tiles 202-212, each tile representing and portraying one of the faces in the original photographic image 400.

The system 100 takes each object or item of interest (in this example each face) as identified from the previous step and runs a machine learning algorithm to find additional details about the item. This is shown as step 116 in Figure 1. For a face, additional details might include the state of the eyes (open, closed, half-open, obscured, blinking, etc), the state of the mouth (smiling, frowning, laughing, surprised, etc.), and the focus of the face (completely in focus, completely out of focus, etc.). Objects that are not faces would have other details that the algorithm finds. The algorithm might for example evaluate a wedding ring for focus, highlights, shininess, cut, angle, and color.

It will be seen that there is annotation, or surfacing, of the images which identifies the faces and provides a summary of one or more of the visual attributes of each face. This annotation is generally consistent across Figures 2-4. As will be described further below, the indicia beneath each face provide clear visual identifiers or annotations for a user regarding likely desirable and/or undesirable aspects of the features of interest (in this embodiment faces) shown in each tile. The tiling panel allows the features of interest to be clearly displayed adjacent to each other and in a manner in which they be appropriately scaled relative to each other for ease of comparison.

For purposes of ease of description, the remainder of the embodiments herein will be described using the example of human faces as the feature of interest in the images. As described above, those skilled in the art will appreciate that other features may instead (or in addition) be identified as features of interest.

An output of the machine learning algorithm that is provided to allow a user to quickly identify and find additional details are annotations (214-222 in Figure 2) on or adjacent to the item of interest with objective details. This is shown as step 118 of Figure 1. The annotations allow a user such as a photographer to discover, at a glance, the state of an objective detail, without having to zoom into an item and carefully inspect it. This saves the photographer time when comparing multiple images of the same object or scene.

As shown in Figure 3A, the annotation 304 in some embodiments comprises a line which can be coloured to provide a very quick visual summary to a user of desirable or undesirable aspects of the objects detected in the image. Indicia 304 may in some embodiments be red if the system 100 has determined that there is a low score or ranking for the object or attributes of the object. Indicia 304 may in some embodiments indicate a score or ranking for the eyes of the face to which it relates. Considering the objects i.e. faces in Figure 2, it can be seen that 202 has eyes closed and 210 is facing downward so that the eyes cannot be clearly seen. Therefore, annotation indicia 304 is red for those faces. Annotation indicia 304 is also red for face 206 because the eyes are almost shut. Indicia 306 may in some embodiments be red if the system 100 has determined that there is a low score or ranking for the object or attributes of the object. Indicia 306 may in some embodiments indicate a score or ranking for the focus of the face to which it relates. The annotations 304 and 306 convey scoring information but do not interfere with viewing of the images, either in zoomed format or original format. Scoring and ranking algorithms are described later in this document.

Some objective annotations can be hidden to avoid cluttering the screen with annotations and overwhelming the photographer. In Figure 4 the original image 400 relating to Figures 2 and 3 is shown. Annotations are present, corresponding to those in Figures 2 and 3. The same annotation indicia are used, but at different sizes or scales in order to assist the user with identification, but not obscure the relevant details in the features of interest.

In some embodiments, this may be all the assistance that the photographer wants or requires, since there is readily identifiable guidance across groups of photographs that allows a photographer to quickly make decisions as to which photographs should be discarded and which should be kept.

In other embodiments system 100 provides a ranking or scoring system which can rank or score images based on their desirability, or more correctly, by ranking based on undesirability. This is shown as step 120 in Figure 1. The ranking or scoring feature allows images to be automatically filtered out of the collection of images. This is shown as step 122 in Figure 1 Therefore, system 100 may take the images in a scene grouping, the faces and items of interest and all their objective details and combine this information into an assessment of image desirability. This image desirability rank can then be used to filter out any undesirable images and/or highlight desirable images. By filtering out undesirable images the photographer can avoid having to manually remove those images and save time. By highlighting desirable images the photographer can quickly generate a preview of the best images in a collection and save considerable time.

In some embodiments images that have been filtered out may be identified as such. Thus, for example, Figure 5 shows a series of images 502- 506 in which the central image 504 is marked or annotated with a coloured line or segment at 508 to identify that it is discarded or filtered out of the collection. In this example image 504 has been filtered out because the eyes of the face identified in the image are closed. This is not something that is easily identified by a photographer viewing the image thumbnails, or similar low-resolution representations, and has therefore saved the user considerable time and effort.

An example of a process by which system 100 may provide a user interface according to 114 of Figure 1 to zoom to relevant faces in an image and to tile faces in a close-up panel will now be described with reference to Figure 6.

The process begins with the original image 102 being scaled down to a smaller size scaled image 602. The smaller size has the advantage that it allows the Machine Learning (ML) model to run quickly on a user's computing device. A large image would require more processing power and would take longer to run, frustrating the user or requiring specialised powerful ML computing hardware that is not commonly in use as users' personal computers. The scaled image 602 is inputted to a Neural Network (NN) 604. The NN 604 processes the image to identify the objects of interest and outputs as metadata in step 606 the coordinates of bounding boxes for all the faces (or other objects of interest) in the image. These coordinates are then scaled up to the original image's dimensions and used by an application to crop the original image to retrieve a high-resolution image of each face or item of interest. These resulting high resolution images can then be displayed as desired. As explained above, in one embodiment the images may be arranged as a panel of tiles which is generally represented as 608 in Figure 6, and shown as 202 in the practical example shown in Figure 2.

In another embodiment, the NN 604 that finds faces or items of interest is trained specifically to distinguish between a) objects such as faces in the background of an image i.e. items or features that a user such the photographer does not care about, and b) objects such as faces that are the primary subject of the image. This results in more relevant faces (or items) appearing in the close-up panel and increases the likelihood of zooming to the most important face (or item) in an image. In some embodiments the primary object or objects of interest are detected from their recurrence in a scene. For example, in a wedding scene the bride and groom will occur in the majority of images and system 100 may use the frequency of detection of a same or similar face as an attribute for determining primacy. In other embodiments the detection of the object that is a primary subject may be based on scoring of relative position/location or size of the detected object. Scoring is discussed later in this document.

As set forth above, a user when reviewing images that have been, or are being, processed by system 100 will often want to jump from a preview of the image into a detailed i.e. zoomed part of the original image to view certain detail. Further disclosure of generating an original or zoomed image from a low-resolution preview will now be provided.

In the vast majority of cases professional photographers take photographs in a "raw" image format. A raw image format essentially contains:

• signal data collected by the light sensors inside of camera

• metadata (ex. rotation, resolution, white point, lens used, etc.)

• a pre-rendered version of the camera's signal data This pre-rendered version of the camera's signal data is typically referred to as an "embedded JPEG". This is because the JPEG image format is used to represent the version, and it's embedded in the sense it is located within the raw image file and not as its own separate file.

While this does not fundamentally alter this disclosure, those skilled in the art will appreciate that there is no such thing as the raw image format, since there are numerous raw image formats. Most camera manufacturers have their own raw image format, and some have multiple. For example, SonyTM raw images use the file extension arw while CanonTM uses cr2 and cr3. These formats are constantly changing as camera manufacturers release new cameras (and are rarely, if ever, documented).

For some raw image formats (such as those developed by SonyTM), the embedded JPEG is quite low resolution; primarily intended to be used on the camera itself when reviewing images. Displaying this embedded JPEG on a much larger computer screen is undesirable as the photographer will be unable to sufficiently see fine details in a photo such as a person's eyes. For other raw image formats (such as those developed by CanonTM) the embedded JPEG is typically the same resolution as the underlying raw sensor data and is sufficiently detailed to be used directly.

In some embodiments system 100 can always generate a full resolution version of each raw image instead of using the embedded JPEG. However, in practice this is not necessary and has numerous downsides as doing so uses up computer resources that could instead be put towards other valuable tasks (such as the ML assessments disclosed herein). Furthermore, on a laptop for example each additional computation further depletes the battery.

In some embodiments system 100 determines whether to generate a full resolution version from the raw image. This can be done in numerous ways including:

• file extension o Some manufacturers' (ex. SonyTM) formats always have low resolution embedded JPEG o This means of determination is extremely efficient as it does not require reading the contents of the actual file

• comparison of metadata to embedded JPEG o The metadata of the photo includes resolution information for the image represented by the camera's signal data o The embedded JPEG contains information on its own resolution o Then these two resolutions can be compared o This determination can be quite inefficient as it requires reading substantial portions of the file content; in practice these files are often stored on slow external mechanical hard drives

• corresponding JPEG o Some cameras can optionally capture a corresponding JPEG for each raw image taken o When this occurs, there is no need to perform the above comparisons as this corresponding JPEG can be used directly

A large amount of work is required to turn the signal data captured by a camera's light sensors into something displayable on a screen that a human would consider representative of what their eyes would see in the same situation.

To perform this work system 100 uses existing open source technologies for reading raw files (such as LibrawTM) along with tone mapping. Tone mapping is a process by which the colors of an image are "tweaked" such that they more closely resemble what the photographer actually saw on the digital viewfinder of their camera at capture time. If this is not done the photo will look "flat" or "washed out".

Generating an image from a camera's signal data can easily take an entire second. This is both because almost the entire content of the raw image file (which is frequently in the range of 25 - 50 megabytes) must be read (often from a slow external mechanical hard drive) as well as because of the computationally expensive operations required to transform the signal data itself.

Because of this, doing this one second of work at the right time is very desirable. This is further complicated by the fact that:

• people in general have little patience when using technology

• professional photographers put a significant premium on fast applications

• there is a large amount of other work that needs to be done by system 100 (such as performing machine learning assessments on the images)

As such, system 100 employs a prioritization system which can constantly reorder the next set of work it will undertake based on the current image being viewed as well as the images proximate to that one.

Despite the best efforts of the prioritization system it can easily be the case that a user will navigate to an image for which no original high-resolution image has yet been generated from the camera's signal data.

The easiest way to handle this is to not show the image to the user until the image generation has completed. Unfortunately, this is an highly unattractive option for many users.

Accordingly, in some embodiments, system 100 shows a user a representative image in the time between the user navigating to i.e. requesting the image and the high-resolution image generation completing. The representative image used by system 100 is the embedded JPEG which, while resembling the camera's actual signal data, is much lower resolution.

The fundamental challenge is that the embedded JPEG and the to-be- generated image are not the same as one another. In particular the embedded JPEG is much lower resolution than the to-be-generated image. For example the embedded JPEG might be 3,000 x 4,000 pixels while the to-be-generated image might be 9,000 x 12,000 pixels. The reason this is a problem is the moment the generated image exists it is desirable to show it to the user. Naively exchanging i.e. "swapping out" a small image with a much larger one would be highly problematic. For example, if the user was zoomed into the small image at the moment of the exchange, then there will be a very awkward transition. Being zoomed 2x into the small image wouldn't be equivalent to being 2x zoomed into the much larger generated image.

The system 100 can be configured in some embodiments to treat the original smaller image as if it were the size of the to-be-generated high- resolution image. For example, to treat the 3,000 x 4,000-pixel image as if it were 9,000 by 12,000 pixels. This allows a person to zoom into a portion of the image such that it may become highly pixelated, but less than second later it can be exchanged by system 100 for the full resolution version which will not be highly pixelated.

There are multiple ways that a generated image can be persisted such that the high resolution or original image generation step (the approximately one second of work referred to above) does not need to be performed each and every time a user views an image. In various embodiments system 100 performs one of more of the following:

• Creating a corresponding file o This mimics the behaviors some cameras have to optionally capture a corresponding JPEG for each raw image taken

• Store the generated image in a cache o Image data is stored semi-permanently in a location not corresponding to the original image and not directly accessible by a user o This cache can take many different forms including those that are written to the computer's disk as well as those that exist purely in memory Once a generated image has been persisted, as long as it remains persistent it does not need to be generated again.

An example of the process above is shown in Figure 6A. Beginning at block 650 system 100 determines whether the requested image has already been generated. If yes, then the generated image is shown at block 652. If not, then generation is prioritised at block 654. This begins a parallel stream of image generation at step 656 while the sequence beginning at 658 commences with extraction of a low-resolution version (such as an embedded JPEG) of the requested image. The full resolution of the original image is determined at block 660 following which the JPEG is scaled to the full resolution at 662 then rendered at 664. Once generated, the image can be exchanged for the scaled JPEG as shown at block 670. As shown in Figure 6A, the generated full resolution image can be persisted at 666, and after checking that the persisted image corresponds to the displayed image (at 668) then the exchange can occur. As described above, one or more embodiments use NN's. Some embodiments use Convolutional Neural Networks. The Convolutional Neural Network (CNN) is a preferred basic structure used in several embodiments for finding the faces in an image and classifying various objective visual attributes of an image or the face(s) in an image. Figure 7 is a diagram that illustrates how a CNN works in a simplified form.

Referring to Figure 7, a CNN 700 takes in the input image 102 which in this example includes a car and transforms the image with several convolutions, represented by two convolution layers 702 and 704. Each convolution reveals features of the image that can be interpreted by the CNN. Pooling layers 706 and 708 aggregate the output of a convolution to prevent the CNN getting too large. Finally, a fully connected layer 710 (also known as Dense Layer or Detection Head) interprets the features and processes them into a classification or prediction 712. The fully connected layer may have multiple hidden layers, and the prediction 712 may comprise multiple different classes of objects which are classified with an associated probability score for example. The CNN 700 is trained with a collection of labelled training images. The training data is used to adjust the weights inside the CNN using a backpropagation algorithm. The backpropagation algorithm adjusts the weights until the output of the CNN can successfully predict the training data. With a sufficiently large set of training data, the resulting structure inside the CNN is sufficiently generalised so that it can successfully classify new images that have not been seen in training.

A scene grouping method or process according to some embodiments will now be described with reference to the CNN concepts introduced above. Referring to Figure 8, a Scene Grouping method 800 groups visually and semantically similar images 102-106 together. One embodiment of this process uses one or more Variational Auto-Encoders (VAEs) 802-806 to do the grouping. The or each VAE is a CNN structure with a series of NN layers as shown diagrammatically as VAE 900 in Figure 9. These are convolved in layers 902 with pooling 904 occurring after each convolution to gradually reduce the number of parameters in each layer until they reach a middle point of minimum size at 922. The network then expands though upsampling 910 and convolution 912 of each upsampled layer to arrive at 912 which comprises an attempt to recreate the original image from the encoded/compressed representation. The network 802 is trained on training images until it can successfully reproduce the training images. Typically, VAE training is unsupervised, meaning that it is trained without the need for labelled data. In one embodiment, the VAE 900 is trained in the conventional way, with unlabelled data.

In another embodiment, the VAE 900 is trained on a series of training images belonging to a Scene Grouping. This is a novel supervised method of training the VAE. The training method is described below.

If, given an image in the scene, the VAE 900 successfully recreates an image that is visually similar to any of the other images in the scene, that is counted as a success. The success is fed back to a backpropagation algorithm and thereby used to train the VAE. This teaches the VAE to encode details relevant to the grouping of scenes when compressing the information in an image and throw away details that are irrelevant. This can vary dependent upon what the user considers should constitute a scene. In the example of photographs taken at a social gathering, consideration of whether images belong to the same scene should have more emphasis on the people in the images. Therefore, for example, changes in the people in a photo are important, while changes in the clouds in the sky are not. A conventional VAE would consider both equally. The VAE training method disclosed herein therefore allows the VAE to distinguish between relevant and irrelevant details when determining what images should constitute a scene among a collection of images. The training is represented in Figure 8 with VAE 900 comprising an encoder 920 which performs the convolution and pooling to produce the image vector 922, and decoder 924 which produces the attempted recreation of the original image. The use of training data is shown at 926.

After training the VAE 900, the Decoder 924 is discarded and only the Encoder is used in the system, comprising the VAEs 802-806 as shown in Figure 8. The Encoder 920 takes an image and encodes it into the compressed representation that captures the details that are relevant to the scene grouping. The compressed representation comprises an Image Vector (IV) consisting of coordinates in a higher dimensional space.

Figures 10A and 10B provide further clarity on the VAEs. As shown in Figure 10A, a collection of images 102-106 that form a scene can be used to train the VAE. The recreated image 952 is compared to the input image at 954. This comparison may be performed by a human, or it could be performed digitally by a machine, for example comparing pixels. If the difference is less than a minimum threshold then the training is complete, and the system proceeds as per Figure 10B in which images 108-112 are input. These produce image vectors 960-963 which are compared to determine which of the images 108-112 constitute or comprise a scene.

When comparing images to see if they belong to the same scene, the system 100 first computes the IV for each Image, then it generates a pairwise distance for each image combination, as indicated at 810 in Figure 8, to then generate scene grouping 812.

If the system were to compare every image in a collection of images to every other image in that collection of images, then that would result in an algorithm with a computational complexity of 0(n 2 ). That, in turn, would result in an exponentially increasing time to process the images with increasing numbers of images. To avoid this problem at least one embodiment compares images in the collection that have a contextual relationship that can be determined from metadata relating to each image. This avoids this exponential increase computational complexity. In one embodiment the contextual relationship is temporal, so system 100 only compare images in a sequence of the capture time of the images. In another embodiment the relationship is serial, so for example each image is only compared to its direct successor. These processes result in a computational complexity of O(n).

In one embodiment the method compares the IV of any given image IV to the IV of the next few images according to the image capture time, as recorded by the camera in the EXIF metadata of the image. An alternative embodiment compares each image to its direct successor and the successor's successor, up until some comparison factor (F). A comparison factor of F=2 is shown diagrammatically in Figure 11A, with each comparison represented as 1001. With a comparison factor of F=3, as shown by comparison groups 1101 in Figure 11B, the method can maintain a scene even if there is an outlier image in a sequence that would otherwise break the scene. For example, a photographer might take a series of photos of a bridge, then takes one photo of the ground, and then continues to take photos of the bridge. F=3 would maintain the scene in this case. Larger values of F increase the computation complexity of the algorithm to 0(F * n) while maintaining scenes for longer sequences.

The sequence of images 1202 in Figure 12 would require F=5 to maintain the scene for the two people. In some embodiments F can be dynamically adjusted based on computational capacity or computational load. For example, if at F=5 it is taking more than a threshold time period to process the input images, then F can be decremented by one to F=4, and the processing time can be monitored to determine whether it has improved. If there has not been sufficient improvement, then F can be decremented again. The opposite can occur i.e. if there is additional computational capacity then F can be incremented to improve the quality of the detection until a point is reached at which computational capacity has been reached.

A threshold Theta (Q) is used to control the "looseness" of the scenes. A large Q results in scenes with a large number of loosely related images, while a small Q results in small scenes of closely related images. In one embodiment, the value for Q is fixed to an average that has been found to work well for most types of images. In another embodiment, the value for Q is dynamically adjusted based on the type of photography. Portrait photography, for example, might require a smaller value for Q, as a minor variation in the photos can result in a new scene. In some embodiments dynamic adjustment can occur automatically as the algorithm learns the nature of the scene generally, or from feedback indicative of too many or too few different scenes in any given collection of images for example.

In order to compute the similarity between images, the method compares images' IVs using various methods or algorithms for measuring distances in multi-dimensional Euclidean coordinate space. One embodiment uses L2 Euclidean distance. Another embodiment uses LI "Manhattan Distance", as indicated in Figure 8. Another embodiment can benefit from Fractional Distance Measures. These use a Fractional Distance Measure such as L0.5 which is a weighted combination of LI and L2 and which can result in the scene groupings that are more resistant to visual noise in the images. A graphical relationship between accuracy ratio and noise masking probability is shown in Figure 13, which illustrates how fractional distance measures are more resilient to noise in an image. Having grouped images into scenes, the details relating to the feature or features of interest in each image (which are referred to in this document as objective image annotations) comprising the scene are detected, and image desirability or lack thereof is determined. Further disclosure regarding an efficient CNN to detect these objective image annotations is provided later in this document. A process according to at least some embodiments for ranking, scoring or distilling desirable or undesirable images will now be described.

Referring to Figure 14, a method 1400 for rejecting undesirable images and selecting desirable images from a collection of digital images is shown.

The algorithm relies on inputs from methods referred to above: Scene Grouping, Face Finding, and Object Image Annotations.

The method 1400 first computes Image Desirability Scores 1404-1408 for different objects and/or visual attributes of those objects for all images in a Scene Group. It then uses an algorithm 1410 to find outlier images that deviate from a median Image Desirability Score. The outlier images are selected to be labelled as undesirable 1420 or desirable 1430 depending on whether they are lower scoring or higher scoring respectively.

Figure 15 shows an algorithm 1500 according to some embodiments which is used to compute which images to reject as undesirable.

The exact weightings (W e x P ression, W f0 cus, W eye ) in the algorithm can vary based on the embodiment of the invention. One embodiment includes the following weightings shown in Tables la, lb, lc.

Table la: Eye State

Table lb: Expression State Table lc: Focus State The process begins in Figure 15 with retrieving all the images in a scene at block 1501. These may be obtained from one or more of the processes described above, or may be provided directly in some embodiments.

For each image, at block 1502, and for each detected face in each image, at 1503, the following occurs. An object score for the relative size of each object (in this example each face) and/or the relative position of the face in the image is determined. Although Figure 15 shows both size and position being calculated for scoring purposes, only one may be selected in some embodiments. As can be seen from the Figure, the object score may be input at 1530 for a combined score. Similarly, One or two or all three of the confidences for attributes of each detected face may be determined in blocks 1506, 1507 and 1508. These confidence levels comprise scores, or are processed further as described below to provide scores that individually or collectively can be used to score the visual attributes of each face. The states for each of blocks 1506-1508 are provided in tables la-lc. The training images used to train the CNN to detect the states are selected to reflect the required visual attributes. The attribute scores may be broken down to an expression score which is included in the combined score at 1532, and an eye score which is included in the combined score at 1531.

Weights are in some embodiments applied in blocks 1509, 1510, 1511 and 1512. In some embodiments penalties are added for poor focus scores (1514) and eye scores (1515), and the penalized focus and eye scores are combined (1517). Normalization of scores such as the eye scores (1516) and the expression scores (1513) can be performed to minimize the effect of negative scores. The scores are combined to produce a combined score at block 1518.

If there are more than two images in the scene (1521), then at block 1520 the scores are statistically processed to determine a threshold score so that a decision can be made as to which images to reject. In one embodiment the statistical processing includes calculating the median or mean of the combined image scores, and calculating the standard deviation, or applying a multiplier or proportion (the STD_Factor) of the standard deviation of the combined image scores. The difference between these calculations is used to generate a lower bound or threshold. The number of images below the lower bound is calculated in block 1523 and the images with scores below the lower bound are rejected or marked as rejected in block 1524.

If the are only two images in the scene then the absolute combined score difference between the images is calculated at block 1522. If the difference is above a threshold (1525) then the image with the lowest score is rejected (1526), otherwise the images are both kept (1527).

The STD_Factor (from Figure 15) can be modified to adjust the number of images the method rejects. The smaller the factor, the more images will be rejected, potentially resulting in false positive rejections. An embodiment with a factor of 0.2 rejects about 25% of the most undesirable images in a collection. An embodiment with a factor of 0.7 rejects about 15% of the most undesirable images. An embodiment with a factor of 1.2 rejects about 10% of the most undesirable images.

The Machine Learning (ML) classifications and assessments rely on ML models. The models learn to detect bounding boxes, classifications, and assessments from labelled training data. They learn by adjusting weights of Neural Network parameters using a backpropagation algorithm inside Convolution Neural Networks (CNNs). Various different CNN structures exist. Some structures are better than others at learning specific kinds of patterns.

All embodiments disclosed above share a common goal in that they should maintain high levels of accuracy in a network with a relatively small number of parameters. Smaller numbers of parameters allow the network to compute assessments more quickly and thereby give the user of the software a better user-experience. This section describes the ways in which the structures of the CNNs used in one or more embodiments are optimised to be both fast and small. Optimising a CNN to be fast and small involves creating a CNN model structure that can maintain high levels of accuracy in a network with a relatively small number of parameters. Networks with small numbers of parameters tend to use less memory and computing classifications more quickly than larger networks. This is particularly important for running CNN models on a user's personal computing device, such as a mobile phone or laptop where computational resources are constrained.

An embodiment involves a model that can find faces in an image with 700,000 parameters and computation complexity of 3 GMAC/s; and a model that can detect the state of the eyes of a face with 125,000 parameters and a computational complexity of 0.02 GMAC/s.

These kinds of CNN models that can maintain high accuracy with fewer parameters are made possible by combining multiple methods.

Figure 16 shows a diagram of a CNN model, system or structure 1600 according to an embodiment in which computationally inexpensive convolution blocks 1602 are provided. Feature maps 1604 are produced by the mathematical convolution transformation that occurs in each block 1602. Blocks 1602 take advantage of redundant repeating patterns in feature maps that are typically produced by the mathematical convolution. Lightweight modules 1606 use simple filters to mirror and thus approximate or inexpensively replicate the repeating patterns from convolutions. The modules 1606 can replace some of the computationally expensive convolution operations, and thereby reduce the Neural Network's computational complexity while simultaneously increasing recognition accuracy. In some embodiments neural network architectures are employed in which all the regular convolutional layers are replaced with layers that use approximation modules 1606.

The convolution block 1602 is downsampled to generate convolution block 1610 which is in turn downsampled to generate convolution block 1612. Block 1612 is convolved in a lxl convolution to provide block 1614 which is upsampled to generate block 1616. This is upsampled to generate block 1618. Thi structure provides two pathways, one downsampled and the other upsampled. The downsampled pathway allows features to be extracted, so becomes semantically richer but spatial resolution decreases. The upsampled pathway constructs higher resolution layers from a semantic rich layer. Although the reconstructed layers are semantically strong, the locations of objects to be detected are imprecise so convolutions (lxl convolutions) are added between blocks 1610 and 1616 and between blocks 1602 and 1618 to provide lateral connections between reconstructed layers and the corresponding feature maps. This improves object location prediction.

The structure provides scale-invariance for the model, allowing it to detect features in an image regardless of how large they appear. Intuitively, this property enables a model to detect objects across a large range of scales by scanning the model over both positions and sampling levels.

The model structure combines low-resolution, semantically strong features with high-resolution, semantically weak features. The result is a NN that has representational power, speed, and low memory usage.

Working with high-resolution representations of visual information very helpful for identifying details in images. The CNN structure 1600 is in some embodiments further maintains high-resolution representations: (i) Connect the high-to-low resolution convolution streams in parallel; (ii) Repeatedly exchange the information across higher and lower resolutions. Therefore, in some embodiments the information is exchanged between blocks 1612 and 1616, and 1614 and 1610 for example. The benefit is that the resulting representation is semantically richer and spatially more precise.

When preparing training data to train the CNN to detect faces, in addition to annotating the bounding box of each face in an image, further facial landmarks are provided in some embodiments, as represented by training input image 1650. A better representation of training images is shown in Figure 17. In some embodiments five facial landmarks (both eyes, top of the nose, and corners of the mouth) are provided. This results in a significant improvement in recognition accuracy. Particularly in hard face detection problems, e.g. when the face is at an odd angle or blurry, the landmarks are an additional supervision signal that helps the CNN learn.

In a conventional CNN the fully connected or dense layers contain a large number of parameters. This introduces significant computational complexity to the network. Therefore, in some embodiments block detectors 1620-1624 are provided. These are configured as a sub-structure that removes the conventional fully connected layer and replaces it with a custom structure that provides greater recognition accuracy with significantly less computational complexity. The block detectors 1620-1624 are connected to each layer 1618-1614 to generate output predictions 1630 -1634.

As used herein, the term "image" refers to any digital item capable of producing a visual representation. For instance, the term "image" includes digital images and frames of digital video. As used herein, the term "digital image" refers to any digital symbol, picture, icon, or illustration. For example, the term "digital image" includes digital files with the following file extensions: JPG, TIFF, BMP, PNG, RAW, or PDF.

The term "machine learning," as used herein, refers to the process of constructing and implementing algorithms that can learn from and make predictions on data. In general, machine learning may operate by building models from example inputs (e.g., training), such as a training font set, to make data-driven predictions or decisions. In some example embodiments, machine learning is used for data mining, and statistical pattern recognition, such as collaborative feature learning, or learning features from a training image set, which can be supervised or unsupervised.

As used herein, the term "neural network" refers to a machine learning model that can be tuned (e.g., trained) based on inputs to approximate unknown functions. In particular, the term neural network can include a model of interconnected neurons that communicate and learn to approximate complex functions and generate outputs based on a plurality of inputs provided to the model. For instance, the term neural network includes one or more machine learning algorithms. In particular, the term neural network can include deep convolutional neural networks (i.e., "CNNs"). In addition, a neural network is an algorithm (or set of algorithms) that implements deep learning techniques that utilize a set of algorithms to model high-level abstractions in data.

Each of the components of system 100 and their corresponding elements may be in communication with one another using any suitable communication technologies. It will be recognized that although the plurality of components and their corresponding elements are shown to be separate in a number of the drawing Figures, any of the plurality of components and their corresponding elements may be combined into fewer components, such as into a single facility or module, divided into more components, or configured into different components as may serve a particular embodiment.

The plurality of components and their corresponding elements can comprise software, hardware, or both. For example, the plurality of components and their corresponding elements can comprise one or more instructions stored on a computer-readable storage medium and executable by processors of one or more computing devices. The plurality of components and their corresponding elements can comprise hardware, such as a special purpose processing device to perform a certain function or group of functions. Additionally, or alternatively, the plurality of components and their corresponding elements can comprise a combination of computer-executable instructions and hardware.

Furthermore, the plurality of components of the system 100 may, for example, be implemented as one or more stand-alone applications, as one or more modules of an application, as one or more plug-ins, as one or more library functions or functions that may be called by other applications, and/or as a cloud-computing model. Thus, the plurality of components of the system 100 may be implemented as a stand-alone application, such as a desktop or mobile application. Furthermore, the plurality of components of the system 100 may be implemented as one or more web-based applications hosted on a remote server. Alternatively, or additionally, the components of the system 100 may be implemented in a suit of mobile device applications or "apps."

The plurality of components described above for operating the system 100 may be implemented in a system environment. Specifically, Figure 18 illustrates a schematic diagram of one embodiment of an exemplary system environment ("environment") 2100 in which a system 100 can operate. As illustrated in Figure 18, the environment 2100 can include a server(s) 2110 , a network 2116 , and a plurality of client devices 2110 a - 2110 n . The server(s) 2110 , the network 2116 , the plurality of client devices 2110 a - 2110 n , and the other components of the environment 2100 may be communicatively coupled with each other either directly or indirectly (e.g., through the network 2116 ).

As illustrated in Figure 18, the environment 2100 can include the server(s) 2110 . The server(s) 2110 may generate, store, receive, and/or transmit any type of data. The server(s) 2110 can comprise a communication server or a web-hosting server. In one or more embodiments, the server(s) 2110 may comprise a data server.

Moreover, as illustrated in Figure 18, the environment 2100 can include the plurality of client devices 2110 a - 2110 n . The plurality of client devices 2110 a - 2110 n may comprise a variety of different computing devices, such as personal computers, laptop computers, mobile devices, smartphones, tablets, special purpose computers, TVs, or other computing devices. As illustrated in Figure 21, the plurality of client devices 2110 a - 2110 n , and/or the server(s) 2110 may communicate via the network 2116 .

As illustrated, the system 100 or any part or component thereof can be implemented by a variety of components in the environment 2100. For example, the server(s) 2110 can host the neural networks utilized herein, while the plurality of client devices 2110 a - 2110 n can implement the ranking or scoring process, or simply be used to provide an interface for a user. When implemented in part on the server(s) 2110 , and in part on the plurality of client devices 2110 a - 2110 n , the system 100 components are communicatively coupled (i.e., via the network 2116 ).

Although Figure 18 illustrates a single server(s) 2110 , it will be appreciated that the server(s) 2110 can represent any number of server computing devices. Similarly, although Figure 18 illustrates a particular arrangement of the server(s) 2110 , network 2116 , and the plurality of client devices 2110 a - 2110 n , various additional arrangements are possible.

The disclosure above provides a number of different systems and devices. In addition to the foregoing, embodiments can also be described in terms of a series of acts for accomplishing a particular result. For example, while various flow charts have been shown and described according to various embodiments, alternative embodiments may omit, add to, reorder, and/or modify any of the acts shown or described. The acts of the Figures can be performed as part of a method. Alternatively, a non-transitory computer readable medium can comprise instructions, that when executed by one or more processors, cause a computing device to perform the acts of the Figures and description. In still further embodiments, a system can perform the acts of the Figures and description.

Embodiments of the present disclosure may comprise or utilize a special purpose or general-purpose computer including computer hardware, such as, for example, one or more processors and system memory, as discussed in greater detail below. Embodiments within the scope of the present disclosure also include physical and other computer-readable media for carrying or storing computer-executable instructions and/or data structures. In particular, one or more of the processes described herein may be implemented at least in part as instructions embodied in a non-transitory computer-readable medium and executable by one or more computing devices (e.g., any of the media content access devices described herein). In general, a processor (e.g., a microprocessor) receives instructions, from a non-transitory computer-readable medium, (e.g., a memory, etc.), and executes those instructions, thereby performing one or more processes, including one or more of the processes described herein.

Computer-readable media can be any available media that can be accessed by a general purpose or special purpose computer system. Computer-readable media that store computer-executable instructions are non-transitory computer-readable storage media (devices). Computer- readable media that carry computer-executable instructions are transmission media. Thus, by way of example, and not limitation, embodiments of the disclosure can comprise at least two distinctly different kinds of computer- readable media: non-transitory computer-readable storage media (devices) and transmission media.

Non-transitory computer-readable storage media (devices) includes RAM, ROM, EEPROM, CD-ROM, solid state drives ("SSDs") (e.g., based on RAM), Flash memory, phase-change memory ("PCM"), other types of memory, other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer.

A "network" is defined as one or more data links that enable the transport of electronic data between computer systems and/or modules and/or other electronic devices. When information is transferred or provided over a network or another communications connection (either hardwired, wireless, or a combination of hardwired or wireless) to a computer, the computer properly views the connection as a transmission medium. Transmissions media can include a network and/or data links which can be used to carry desired program code means in the form of computer-executable instructions or data structures and which can be accessed by a general purpose or special purpose computer. Combinations of the above should also be included within the scope of computer-readable media. Further, upon reaching various computer system components, program code means in the form of computer-executable instructions or data structures can be transferred automatically from transmission media to non- transitory computer-readable storage media (devices) (or vice versa). For example, computer-executable instructions or data structures received over a network or data link can be buffered in RAM within a network interface module (e.g., a "NIC"), and then eventually transferred to computer system RAM and/or to less volatile computer storage media (devices) at a computer system. Thus, it should be understood that non-transitory computer-readable storage media (devices) can be included in computer system components that also (or even primarily) utilize transmission media.

Computer-executable instructions comprise, for example, instructions and data which, when executed at a processor, cause a general-purpose computer, special purpose computer, or special purpose processing device to perform a certain function or group of functions. In some embodiments, computer-executable instructions are executed on a general-purpose computer to turn the general-purpose computer into a special purpose computer implementing elements of the disclosure. The computer executable instructions may be, for example, binaries, intermediate format instructions such as assembly language, or even source code. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the described features or acts described above. Rather, the described features and acts are disclosed as example forms of implementing the claims.

Those skilled in the art will appreciate that the disclosure may be practiced in network computing environments with many types of computer system configurations, including, personal computers, desktop computers, laptop computers, message processors, hand-held devices, multi-processor systems, microprocessor-based or programmable consumer electronics, network PCs, minicomputers, mainframe computers, mobile telephones, PDAs, tablets, pagers, routers, switches, and the like. The disclosure may also be practiced in distributed system environments where local and remote computer systems, which are linked (either by hardwired data links, wireless data links, or by a combination of hardwired and wireless data links) through a network, both perform tasks. In a distributed system environment, program modules may be located in both local and remote memory storage devices.

Embodiments of the present disclosure can also be implemented in cloud computing environments. In this description, "cloud computing" is defined as a model for enabling on-demand network access to a shared pool of configurable computing resources. For example, cloud computing can be employed in the marketplace to offer ubiquitous and convenient on-demand access to the shared pool of configurable computing resources. The shared pool of configurable computing resources can be rapidly provisioned via virtualization and released with low management effort or service provider interaction, and then scaled accordingly.

A cloud-computing model can be composed of various characteristics such as, for example, on-demand self-service, broad network access, resource pooling, rapid elasticity, measured service, and so forth. A cloud-computing model can also expose various service models, such as, for example, Software as a Service ("SaaS"), Platform as a Service ("PaaS"), and Infrastructure as a Service ("IaaS"). A cloud-computing model can also be deployed using different deployment models such as private cloud, community cloud, public cloud, hybrid cloud, and so forth. In this description and in the claims, a "cloud-computing environment" is an environment in which cloud computing is employed.

Figure 19 illustrates a block diagram of an exemplary computing device 2300 that may be configured to perform one or more of the processes described above. One will appreciate that one or more computing devices such as the computing device 2300 may implement one or more components of the system 100 . As shown by Figure 19, the computing device 1100 can comprise a processor 2302 , a memory 2304 , a storage device 2306 , an I/O interface 2308 , and a communication interface 2310 , which may be communicatively coupled by way of a communication infrastructure 2312 . While an exemplary computing device 2300 is shown in Figure 23, the components illustrated in Figure 23 are not intended to be limiting. Additional or alternative components may be used in other embodiments. Furthermore, in certain embodiments, the computing device 2300 can include fewer components than those shown in Figure 19. Components of the computing device 2300 shown in Figure 19 will now be described in additional detail.

In one or more embodiments, the processor 2302 includes hardware for executing instructions, such as those making up a computer program. As an example and not by way of limitation, to execute instructions, the processor 2302 may retrieve (or fetch) the instructions from an internal register, an internal cache, the memory 2304 , or the storage device 2306 and decode and execute them. In one or more embodiments, the processor 2302 may include one or more internal caches for data, instructions, or addresses. As an example and not by way of limitation, the processor 2302 may include one or more instruction caches, one or more data caches, and one or more translation lookaside buffers (TLBs). Instructions in the instruction caches may be copies of instructions in the memory 2304 or the storage device 2306 .

The memory 2304 may be used for storing data, metadata, and programs for execution by the processor(s). The memory 2304 may include one or more of volatile and non-volatile memories, such as Random Access Memory ("RAM"), Read Only Memory ("ROM"), a solid-state disk ("SSD"), Flash, Phase Change Memory ("PCM"), or other types of data storage. The memory 2304 may be internal or distributed memory.

The storage device 2306 includes storage for storing data or instructions. As an example and not by way of limitation, storage device 2306 can comprise a non-transitory storage medium described above. The storage device 2306 may include a hard disk drive (HDD), a floppy disk drive, flash memory, an optical disc, a magneto-optical disc, magnetic tape, or a Universal Serial Bus (USB) drive or a combination of two or more of these. The storage device 2306 may include removable or non-removable (or fixed) media, where appropriate. The storage device 2306 may be internal or external to the computing device 2300 . In one or more embodiments, the storage device 2306 is non-volatile, solid-state memory. In other embodiments, the storage device 2306 includes read-only memory (ROM). Where appropriate, this ROM may be mask programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), electrically alterable ROM (EAROM), or flash memory or a combination of two or more of these.

The I/O interface 2308 allows a user to provide input to, receive output from, and otherwise transfer data to and receive data from computing device 2300 . The I/O interface 2308 may include a mouse, a keypad or a keyboard, a touch screen, a camera, an optical scanner, network interface, modem, other known I/O devices or a combination of such I/O interfaces. The I/O interface 2308 may include one or more devices for presenting output to a user, including, but not limited to, a graphics engine, a display (e.g., a display screen), one or more output drivers (e.g., display drivers), one or more audio speakers, and one or more audio drivers. In certain embodiments, the I/O interface 2308 is configured to provide graphical data to a display for presentation to a user. The graphical data may be representative of one or more graphical user interfaces and/or any other graphical content as may serve a particular implementation.

The communication interface 2310 can include hardware, software, or both. In any event, the communication interface 2310 can provide one or more interfaces for communication (such as, for example, packet-based communication) between the computing device 2300 and one or more other computing devices or networks. As an example and not by way of limitation, the communication interface 2310 may include a network interface controller (NIC) or network adapter for communicating with an Ethernet or other wire- based network or a wireless NIC (WNIC) or wireless adapter for communicating with a wireless network, such as a WI-FI.

Additionally or alternatively, the communication interface 2310 may facilitate communications with an ad hoc network, a personal area network (PAN), a local area network (LAN), a wide area network (WAN), a metropolitan area network (MAN), or one or more portions of the Internet or a combination of two or more of these. One or more portions of one or more of these networks may be wired or wireless. As an example, the communication interface 2310 may facilitate communications with a wireless PAN (WPAN) (such as, for example, a BLUETOOTH WPAN), a WI-FI network, a WI-MAX network, a cellular telephone network (such as, for example, a Global System for Mobile Communications (GSM) network), or other suitable wireless network or a combination thereof.

Additionally, the communication interface 2310 may facilitate communications across various communication protocols. Examples of communication protocols that may be used include, but are not limited to, data transmission media, communications devices, Transmission Control Protocol ("TCP"), Internet Protocol ("IP"), File Transfer Protocol ("FTP"),

Telnet, Hypertext Transfer Protocol ("HTTP"), Hypertext Transfer Protocol Secure ("HTTPS"), Session Initiation Protocol ("SIP"), Simple Object Access Protocol ("SOAP"), Extensible Mark-up Language ("XML") and variations thereof, Simple Mail Transfer Protocol ("SMTP"), Real-Time Transport Protocol ("RTP"), User Datagram Protocol ("UDP"), Global System for Mobile Communications ("GSM") technologies, Code Division Multiple Access ("CDMA") technologies, Time Division Multiple Access ("TDMA") technologies, Short Message Service ("SMS"), Multimedia Message Service ("MMS"), radio frequency ("RF") signaling technologies, Long Term Evolution ("LTE") technologies, wireless communication technologies, in-band and out-of-band signaling technologies, and other suitable communications networks and technologies.

The communication infrastructure 2312 may include hardware, software, or both that couples components of the computing device 2300 to each other. As an example and not by way of limitation, the communication infrastructure 1112 may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a front side bus (FSB), a HYPERTRANSPORT (HT) interconnect, an Industry Standard Architecture (ISA) bus, an INFINIBAND interconnect, a low-pin-count (LPC) bus, a memory bus, a Micro Channel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCIe) bus, a serial advanced technology attachment (SATA) bus, a Video Electronics Standards Association local (VLB) bus, or another suitable bus or a combination thereof.

The foregoing specification is described with reference to specific exemplary embodiments thereof. Various embodiments and aspects of the disclosure are described with reference to details discussed herein, and the accompanying drawings illustrate the various embodiments. The description above and drawings are illustrative and are not to be construed as limiting. Numerous specific details are described to provide a thorough understanding of various embodiments.

The additional or alternative embodiments may be embodied in other specific forms without departing from its spirit or essential characteristics. The described embodiments are to be considered in all respects only as illustrative and not restrictive. The scope of the invention is, therefore, indicated by the appended claims rather than by the foregoing description. All changes that come within the meaning and range of equivalency of the claims are to be embraced within their scope.