Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
IMAGE AND VIDEO CLASSIFICATION AND FILTERING METHOD
Document Type and Number:
WIPO Patent Application WO/2008/103441
Kind Code:
A1
Abstract:
A method of classifying images or video containing nude content. A scene is identified from the video. A representative video frame is selected from the identified scene. Video frames most likely containing nude images are identified in the pool of representative video frames. The representative video frames are sorted into a list, with the nude video frames in the front of the list. The list is presented to a user for handling.

Inventors:
PHAM HIEP V (US)
CHEN GEORGE Q (US)
Application Number:
PCT/US2008/002346
Publication Date:
August 28, 2008
Filing Date:
February 22, 2008
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
SOARSPACE INC (US)
PHAM HIEP V (US)
CHEN GEORGE Q (US)
International Classes:
G06V10/28; G11B27/00; G06V10/50; G06V10/56
Foreign References:
US6904168B12005-06-07
US20050114907A12005-05-26
US20040208361A12004-10-21
US20050002452A12005-01-06
US20060004716A12006-01-05
Attorney, Agent or Firm:
SZE, James, Y. C. et al. (101 West Broadway Suite 90, San Diego CA, US)
Download PDF:
Claims:

It is claimed:

1. A method of classifying video, the video containing at least one scene with a plurality of video frames, the method comprising: identifying a scene from the video; selecting a representative video frame from the identified scene; identifying nude video frames in the representative video frames most likely containing nude images; sorting the representative video frames into a list, the list containing nude video frames in the front of the list; presenting the list to a user.

2. The method of claim 1, wherein the identifying of the scene comprises: decoding a first video frame; calculating a color profile of the first video frame; decoding a second video frame; calculating a color profile of the second video frame; calculating a color profile difference between the first and the second video frames; identifying the end of a scene when the color profile difference is greater than three times an average color profile of video frames within a scene.

3. The method of claim 2, wherein the representative video frame is the median of a joint likeness measure of video frames within the scene.

4. The method of claim 3, wherein the calculating a color profile of the first video frame includes quantizing the first video frame in Saturation Value (SV) space.

5. The method of claim 4, wherein the calculating a color profile of the second video frame includes quantizing the second video frame in Saturation Value space.

6. The method of claim 5, wherein identifying nude video frames comprises: comparing a quantized color histogram of a suspected video frame with a color histogram of a database of known nudes; marking a suspected video frame as the nude video frame when the quantized color histogram of the suspected video frame is within a predetermined range from the color histogram of the database of known nudes.

7. The method of claim 1 , wherein the representative is the median of a joint likeness measure of video frames within the scene.

8. The method of claim 1, wherein identifying nude video frames comprises: comparing a quantized color histogram of a suspected video frame with a color histogram of a database of known nudes; marking a suspected video frame as the nude video frame when the quantized color histogram of the suspected video frame is within a predetermined range from the color histogram of the database of known nudes.

9. A computer-readable medium encoded with data and instructions to classify video, the video containing at least one scene with a plurality of video frames, that when executed by a computing device causes the computing device to: identify a scene from the video; select a representative video frame from the identified scene; identify nude video frames in the representative video frames most likely containing nude images; sorting the representative video frames into a list, the list containing nude video frames in the front of the list; present the list to a user.

10. The computer-readable medium of claim 9, wherein the identifying of the scene comprises: decoding a first video frame; calculating a color profile of the first video frame; decoding a second video frame; calculating a color profile of the second video frame; calculating a color profile difference between the first and the second video frames; identifying the end of a scene when the color profile difference is greater than three times an average color profile of video frames within a scene.

11. The computer-readable medium of claim 10, wherein the representative video frame is the median of a joint likeness measure of video frames within the scene.

12. The computer-readable medium of claim 11, wherein the calculating a color profile of the first video frame includes quantizing the first video frame in Saturation Value (SV) space.

13. The computer-readable medium of claim 12, wherein the calculating a color profile of the second video frame includes quantizing the second video frame in Saturation Value space.

14. The computer-readable medium of claim 13, wherein identifying nude video frames comprises: comparing a quantized color histogram of a suspected video frame with a color histogram of a database of known nudes; marking a suspected video frame as the nude video frame when the quantized color histogram of the suspected video frame is within a predetermined range from the color histogram of the database of known nudes.

15. The computer-readable medium of claim 9, wherein the representative is the median of a joint likeness measure of video frames within the scene.

16. The computer-readable medium of claim 9, wherein identifying nude video frames comprises: comparing a quantized color histogram of a suspected video frame with a color histogram of a database of known nudes; marking a suspected video frame as the nude video frame when the quantized color histogram of the suspected video frame is within a predetermined range from the color histogram of the database of known nudes.

17. The method of claim 8, wherein presenting the list to the user occurs on a video display.

18. The computer-readable medium of claim 16, wherein presenting the list to the user occurs on a video display.

Description:

U.S. PATENT APPLICATION - IMAGE AND VIDEO CLASSIFICATION

AND FILTERING METHOD

BACKGROUND Field of the Invention

Aspects of the present invention relate in general to a computerized image classification and filtering apparatus. Further aspects of the invention include an apparatus, method or computer-readable medium configured to classify undesired images and filter them from being displayed on a computer terminal or display.

Description of the Related Art

In March, 1989, the European Laboratory for Particle Physics or CERN (Conseil Europeen pour Ia Recherche Nucleaire) developed the World- Wide- Web (WWW, or simply, "the web"), an Internet-based computer network that allows users on one computer to access information stored on other computers through a world-wide network. With an intuitive user-interface, known as a web browser, the web rapidly became a popular way of transmitting and accessing text and graphical information. Since then, there has been a massive expansion in the number of World- Wide- Web sites, and the amount of information placed on the web.

To place information on the web, the information must be stored in a binary or text format in a "file." Binary documents are. saved in known formats that depend upon the information being stored. For example, two-dimensional pictures are often stored in "Joint Photographic Experts Group" (JPEG) or "Graphical Image Format" (GIF) standard formats. Audio files and moving images have other formats as well, such as "wav," "mov," "avi," "mp3," and "mpeg." For text documents, documents are stored in a HyperText Markup Language (HTML) format. The HTML format dictates the appearance and structure of a web text document, also referred to as a "web page." Graphical images often form the basis of internet and web content, and the key to creating a useful and meaningful web-site.

With the explosion of World- Wide- Web content, a great deal of the image content on the Internet is undesirable in a work setting. To be blunt, pornographic or nude picture

content may be offensive to many casual or family users of the World-Wide-Web. Furthermore, companies may wish to filter or remove such content to avoid creating a hostile work environment. Most Internet filtering is text based, but this does not resolve the problem of problematic image data.

SUMMARY

Embodiments include a method of classifying video or still images. A scene is identified from the video. A representative video frame is selected from the identified scene. Video frames most likely containing nude images are identified in the pool of representative video frames. The representative video frames are sorted into a list, with the nude video frames in the front of the list. The list is presented to a user.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. IA-B illustrate a Hue Saturation Value (HSV) color space visualized as a cone or a cylinder.

FIG. 2 depicts an embodiment dividing gray level (saturation value) of an image into seven levels.

FIG. 3 depicts hue space quantization embodiment.

FIG. 4 illustrates an individual photo color adaptation histogram embodiment.

FIG. 5 shows embodiment histogram where the center location, variance, and Gaussian model are adapted.

FIG. 6 illustrates an embodiment showing adaptive skin segmentation.

FIG. 7 shows a 4x4 image grid embodiment.

FIG. 8 illustrates an 8-orientation edge histogram embodiment.

FIGS. 9A-B illustrate the accumulating grid edge histograms into vertical and horizontal histograms in accordance with an embodiment.

FIG. 10 shows the computation of skin color frequency in four central grids.

FIG. 11 depicts an embodiment combining multiple sorting lists.

FIG. 12 is a flowchart of a cross-filtering process embodiment.

FIG. 13 depicts an embodiment electing key frames in a video.

FIG. 14 illustrates an embodiment selecting a representative frame R.

FIG. 15 shows embodiment sorting representative frames.

FIG. 16 illustrates sorting videos based on representative frames.

FIG. 17 shows a system embodiment.

FIG. 18 illustrates a learning embodiment.

FIG. 19 illustrates architecture of a system embodiment.

FIG. 20 shows a user-interface embodiment.

FIG. 21 depicts the use of game controller as an entry device.

DETAILED DESCRIPTION

One aspect of the present invention includes the recognition that undesired images may be classified and filtered via a computerized method and apparatus.

The features of the following embodiments may be combined or used separately.

Non-Uniform HSV Color-space Vector Quantization

Hue Saturation Value is system for describing the physical perception of color, in terms of tint (also known as "hue," or "color tone"), perceived narrowness of the spectrum (the "saturation," or "chroma"), and luminance (also known as "brightness," or value").

As shown in FIGS. IA and IB, HSV color space may be visualized either as a cone (FIG. IA) or a cylinder (FIG. IB).

Hue determines the position on the color wheel or color circle, Saturation is the purity of the color, and Luminance the range of lightness to darkness of the color.

Initially, the saturation- value (SV) space of an image is quantized. When the intensity of an image is less than 32 over a 256-bit scale (8-bit color), the image is perceived as black. Thus all colors whose intensity is less than 32/256=1/8=0.125 are assigned to index 0. Similarly when the saturation is less than 0.125 (over the maximal scale of 1), the color is perceived as gray. It is also known in the art that two adjacent gray blocks are hard to be distinguished if their gray levels differ no more than 32. Thus the gray level is divided into 7 levels, as illustrated in FIG. 2.

Hue space ("H space") quantization is shown in FIG. 3. At each quantized value level ("V level"), v (1, ..., 7), the H space is first quantized into a set of concentric

circular bands. As seen in the SV space quantization map, the number of bands at level v equals to v. Then each band is divided into cells in one of the two ways: equal division or unequal division.

With equal division, a band at level v is equally divided into 2v cells. The number of cells is v 2 +3v+l. Thus, FIG. 3 depicts the quantization at level 3 where there are 19 cells.

With unequal division, the H space is further separated into multiple fan-shaped regions, each region divided into cells similarly to the previous equal division. The specification of the regions and the number of cells in each region are application specific to. For our application, skin colors are assigned more cells than non-skin colors. Blue and green colors are given very small number of cells.

Adaptive Skin Segmentation for Image Nudity Detection

The first sub-process in our system of nude photo detection is to find out human figures with bare skins and remove the background which we call skin segmentation. Our skin segmentation algorithm is based on two observations. First, there is significant overlap among the human skin color of different races. Thus a universal color model is possible which can at least partially segment the human figures from the rest of the image. Second, each photo has its own unique skin color reflection due to the individual person, lighting, hardware devices (camera, scanner), etc. Thus there is a need to adjust the universal color model according to the unique color distribution of each photo to correctly identify the human with bare skin exposed.

In one aspect of the present invention, a method embodiment performs the following:

1. RGB color space to HSV color space conversion. HSV color space is represented as a cone.

2. Global color model - 5,000 images with significant nudities are downloaded from the internet, each pixel converted to HSV space. At this sub-process, the HSV space is uniformly quantized. Quantized color histogram is computed and then approximated by a Gaussian distribution.

3. Non-uniform HSV color space vector quantization. After obtaining the skin color histogram, the HSV space is re-quantized non-uniformly assigning more bins to those colors corresponding to human skin. After non-uniform quantization, the global color model is recomputed.

When a new photo (also known as a "still image") comes in, its color histogram is first computed in the previous non-uniform HSV space. At this sub-process, the exact skin color distribution for this particular photo is unknown. What is known is that this distribution is nearby the global distribution. The following sub- process is to adjust to the universal model to fit the local distribution. Individual photo color adaptation is illustrated in FIG. 4. Starting from the universal distribution, we iteratively adjust.its center location and variance until a best fit is found to the local histogram.

∑b x G(b x )

1) The new center location is calculated by center = -^=; , where x

∑G(b x ) goes through all the pixels of the image. b x is the bin number of pixel x. G(Jb) is the universal Gaussian distribution.

∑(b x - center) 2 -G(Jb x )

2) The new variance is calculated by variance = — = .

∑G(b x ) x

3) Update the universal Gaussian model G by the new center and variance.

4) Repeat 1), 2) and 3) until the center is unchanged at 5.1. At the convergence, G becomes the adapted skin color distribution for specific photo in consideration.

Sub-process 1), 2), 3 and 4) are illustrated in FIG. 5. 6. Segmentation

Now that we have located the skin color distribution specific to the photo, we 'white out' all pixels that do not belong to the distribution, as shown below (yellow boxes are intentionally inserted, not part of the actual results). Additionally a bounding box is found and all pixels outside the bounding box are discarded.

Photo Classification Combining Nude-filtering and Non-nude-filtering

In one embodiment, there are three sub-processes in a photo classification algorithm -.feature, template and cross-filtering. Below each sub-process is described. Note here it is assumed that all images have been pre-processed by the methods described above and re-sampled into one resolution 256x256. It is understood by those

1. Features - a feature is composed of an array of coefficients computed from an image that describes the said image in one aspect.

1) Color feature - 128-bin color histogram in HSV space

2) Edge feature - 128 (16x8) coefficients

• Divide the image into 4x4 equal-size grid, as shown in FIG. 7.

• Perform edge detection on each grid

• Compute 8-orientation edge histogram where the contribution of an edge is proportional to its intensity, uniformly quantized into 16 levels as shown in FIG. 8.

3) Edge profile - 64 (8x8) coefficients

• Accumulate the grid edge histograms into vertical and horizontal strip histograms as illustrated in FIGS. 9A and 9B.

4) Color profile - 256 (32x8) coefficients

• Compute 32-bin strip color histograms where the strips are defined as in 1.3. The differences between this feature can the color feature of 1.1 are o Color profile uses 32-bin histogram whereas color feature uses

128-bin histogram o Color profile comprises of 8 strip histogram whereas color feature only has one histogram for the whole image

5) Color frequency - 5 coefficients

• Strip skin color frequency - take the 2 center vertical strips in color profile and sum up the bins for each strip. Do the same thing for the 2 center horizontal strips. This produces the frequency of skin color pixels in each strip. They are denoted as Ai, A 2 , vi and v 2 .

• Compute the skin color frequency in the 4 central grids as illustrated in FIG. 10. It is denoted as c.

2. Templates — a set of representative (training) images with significant nudities. One embodiment includes images with male, female, standing, lying-down, breast, and butt.

• Perform feature extraction to each template image using the aforementioned methods.

3. Cross-filtering - Color frequency is used to find the likely clean images. Templates are used to find the likely pornography images. These two filters are combined to give the final classification results.

1) Color frequency sorting (Non-pornography filtering)

• Using the 5 coefficients of the color frequency feature, an image is classified as clean if hι<0.5 and A 2 <0.5 and vjθ.5 and v 2 <0.5 and c<0.5.

2) Template sorting (pornography filtering)

• and an image

- where L \ represents Ll norm of a vector, T represents template, / represents image, c, e, cp, ep respectively represents color feature, edge feature, color profile and edge profile.

• As depicted in FIG. 11, each template is used to sort all images in the database based on the likeness measure. Multiple sorting lists are combined in a zigzag manner. Each image is only counted when it first appears in the zigzag order. Duplicated images are ignored.

• After zigzag sorting, a threshold is used to decide if an images is likely to be pornography

3) The cross-filtering process is illustrated in FIG. 12.

Methods for Video Classification

1. Scene cut

As each video frame is decoded, its color profile feature is calculated as described above. Starting from the second decoded frame, the color profile difference between two consecutive frames is also computed. Denote the frame at time t by /' . Then the color profile difference is diff'= L x (I c ' p - I' p ~ ) . A frame k is flagged as a key frame if its color profile difference from the previous frame is greater than 3 times the average up until k- 1 : diff k > 3 • average(diff k~λ ) . As shown in FIG. 13, the first frame is always chosen as a key frame. The difference value of any key frame is not counted towards the average. All frames starting a key frame and ending before the next key frame are grouped into a scene.

2. Representative frame selection

For each scene, the representative frame R is selected by searching in a three- dimensional volume as depicted in FIG. 14.

The joint likeness measure of a frame is defined as the sum of likeness measure from all templates: joint UkenessW ^L^ - I CP ) - {L X (T c - 1 c ) + L x (T e - /,) + £, (T ep - I ep )} where T

T goes through all templates. Then the representative frame is determined as R = medianyoint likeness(I)}v/heτe I goes through all frames in a scene.

3. Scene sorting

We now end up with a two dimensional space - the horizontal axis is spanned by the representative frames of a video, and the vertical axis is spanned by all the videos

in the database. For each video, its representative frames, shown in FIG. 15, are sorted by the cross-filtering procedure introduced above, with most likely violated frames in the front. Due to the use of the joint likeness, the zigzag process is not conducted.

4. Video sorting

As depicted in FIG. 16, videos are sorted based on the top representative frames identified in the previous sub-process, again by the same cross-filtering procedure.

System and Process for Content Classification 1. System

As shown in FIG. 17, an overall system embodiment 1700 comprises three parts: Processing Server 1710 - located at the customer site inside the customer company's firewall; Central Server 1720 - located at Company headquarters; and Inspection Server 1730- located at Company's inspection facility which can be at a geographically remote location.

At the Processing Server 1710, the following are performed:

• Video and image decoding

• Image pre-processing as described in invention disclosure one

• Photo classification as described in invention disclosure two

• Video classification as described in invention disclosure three

• Immediate report to customer of clean content

• Communication to the Central server 1720 of content needing further inspection

Central Server 1720

• Communication to both Processing Server and Inspection server 1730 for the transmission of content

• Monitoring inspection process, second level QA and report generation

• Billing

Inspection Server 1730

• Inspector terminal management

• Provide content for inspection and receive tagging results

• Random insertion for process monitoring

• First level QA by inspection manager

• Communication back to Central server 1720 of tagging results

2. Process embodiments

Customers send photos and videos to the Processing server 1710 under a pre- agreed protocol, and receive classification reports in a pre-agreed format. Within the system, after automatic machine detection, clean content are immediately reported back to the customers. Remaining content are sent out for human inspection which further tag the content into clean and dirty classes. Customer reports are updated accordingly. During the human inspection process, randomly generated violated pictures are sent to each inspector and statistics are collected and sent to a central server 1720 for Quality Assurance purposes. If the missing rate thus detected is too high for a session, all the results from that session are invalidated and content re-examined.

In some embodiments, there may be a machine learning element in the process. Human inspection results are feedback to the automatic machine detection tool to refine the internal mathematic model, as depicted in FIG. 18. Thus the detection accuracy is gradually improved.

3. Architecture

The Processing server 1710, the Central Server and the Inspection server 1730 communicate to one another via the standard internet interface called Web Service using the SOAP protocol, shown in FIG. 19. Content require human inspection are temporarily buffered at the Processing server 1710 until their inspection reports are received. One of the roles of the Central Server is to pass such content from the Processing server 1710 to the Inspection server 1730, collect tagging results from the latter and format them into inspection reports and pass them to the Processing server 1710.

The inspection process is monitored through the random insertion of violated pictures which are mixed with the actual content to be inspected. The Central Server calculates the statistics of each session. If the missing rate is too high, all the content from that session will be re-examined.

The Inspection server 1730 includes a web server which distributes the content to each inspector's terminal using the standard http protocol. Tagging results are collected through the user interface at the terminals, packaged and sent back to the central server.

4. User Interface

The user interface comprises a display screen and an entry device, as shown in FIG. 20. The number of images shown in a screen is configurable. In some embodiments, a typical configuration has 6 images in a page. Each image is surrounded by a color border. The entry device has the same number of keys as that of the images on the display. The location of each key and its color are consistent with that of the image. The purpose of the color coding is to provide the inspectors some visual cues when performing the data entry.

When a violated picture is seen on the screen, the same located and colored key may be pressed. The system responds to the inspector's selection by showing a think diagonal line of the same color as the border. The selection can be togged off in case the selection is accidental. When the Enter key is pressed, all selections of the current page are committed to the database if some images are indeed selected. The inspection automatically moves to the next page. The Back key on the entry device allows the inspector to go to a previously inspected page and make modifications if necessary. There is also an "End" key which when pressed ends a session.

A second embodiment of the current invention uses a game controller, depicted in FIG. 21, as the entry device. The keys at the right are mapped to the image locations on the screen. The joystick at the left are mapped to the Entry and Back key. One of the keys in the middle is mapped to the End.

5. The Quality Assurance (QA) Process

The QA process serves two purposes. First, eliminate fault detections, i.e. clean pictures wrongly identified as violated pictures. Second, monitor the session and identify those where the missing rate is unusually high. The missing rate is obtained by counting how many randomly inserted pictures are missed.

The previous description of the embodiments is provided to enable any person skilled in the art to practice the invention. The various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without the use of inventive faculty. Thus, the present invention is not intended to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.