Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
MASKED BOUNDING-BOX SELECTION FOR TEXT ROTATION PREDICTION
Document Type and Number:
WIPO Patent Application WO/2024/076343
Kind Code:
A1
Abstract:
This application is directed to optical character recognition (OCR). An electronic device obtains an input image including textual content to be extracted from the input image and determines that the textual content has an input text orientation that is distinct from a target text orientation. A rotation angle is determined based on the input text orientation and target text orientation. The input image is rotated by the rotation angle to the target text orientation. The electronic device recognizes the textual content in the rotated input image having the target text orientation. In some embodiments, the electronic device determines one or more bounding boxes enclosing the textual content in the input image and crops the rotated input image according to the one or more bounding boxes to generate a textual content image. The textual content is recognized in the textual content image.

Inventors:
YANG YUEWEN (US)
LIN YUAN (US)
HO CHIU MAN (US)
Application Number:
PCT/US2022/045850
Publication Date:
April 11, 2024
Filing Date:
October 06, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INNOPEAK TECH INC (US)
International Classes:
G06V10/00
Foreign References:
US20080101726A12008-05-01
US20040141645A12004-07-22
US20040114831A12004-06-17
US5235651A1993-08-10
US20150262007A12015-09-17
Attorney, Agent or Firm:
WANG, Jianbai et al. (US)
Download PDF:
Claims:
What is claimed is:

1. An OCR enhancement method, comprising: obtaining an input image including textual content to be extracted from the input image; determining that the textual content has an input text orientation that is distinct from a target text orientation; determining a rotation angle based on the input text orientation and target text orientation; rotating the input image by the rotation angle to the target text orientation; and recognizing the textual content in the rotated input image having the target text orientation.

2. The method of claim 1, wherein recognizing the textual content in the rotated input image further comprises: identifying one or more bounding boxes enclosing the textual content in the input image; cropping the rotated input image according to the one or more bounding boxes to generate a textual content image including the one or more bounding boxes enclosing the textual content; and recognizing the textual content in the textual content image.

3. The method of claim 1 or 2, determining that the textual content has the input text orientation further comprising: applying a backbone network to identify one or more bounding boxes that closely enclose one or more portions of the textual content and extract a plurality of feature maps from the input image, wherein the plurality of feature maps include one or more intermediate feature maps, and the backbone network is configured to control a background noise level, of the one or more intermediate feature maps, associated with an external portion of the input image that is external to the one or more bounding boxes.

4. The method of claim 1 or 2, further comprising: applying a backbone network to extract a plurality of feature maps from the input image; and applying a classification network to an output feature map in the plurality of feature maps to classify the input image to the input text orientation.

5. The method of claim 4, further comprising: applying the backbone network to identify one or more bounding boxes that closely enclose one or more portions of the textual content; and controlling a background noise level of the output feature map that is used by the classification network to classify the input image, wherein the reduced background noise level corresponds to an external portion of the input image that is external to the one or more bounding boxes.

6. The method of claim 4 or 5, wherein the classification network is applied for binary classification and configured to output one of the input text orientation and the target text orientation, and the input text orientation and the target text orientation are complementary to each other.

7. The method of any of claims 4-6, further comprising: determining a resource level of an electronic device; and selecting one of VGG-16, GhostNet, EfficientNet, and ResNet as the backbone network based on the resource level.

8. The method of any of claims 4-7, further comprising, training the backbone network and classification network including: obtaining a plurality of training images and a plurality of associated ground truths; classifying the plurality of training images using the backbone network and classification network; and adjusting the backbone network and classification network to match the classification results with the associated ground truths.

9. The method of any of claims 4-8, where the series of successive intermediate feature maps have resolutions that are scaled successively by a scaling factor

10. The method of any of claims 4-7, wherein the plurality of feature maps include the series of successive intermediate feature maps, the method further comprising training the backbone network and classification network including: obtaining a plurality of training images; classifying the plurality of training images using the backbone network and classification network; and adjusting the backbone network and classification network based on a weighted combination of a classification loss and a feature loss of at least one of the series of successive intermediate feature maps of the training images.

11. The method of claim 10, wherein the weighted combination is represented by: where Lc(s is a binary classification loss and LMSE1 and LMSE2 are two mean squared error (MSE) losses for two of the series of successive intermediate feature maps that are generated from the training images by the backbone network, the method further comprising: creating two intermediate feature ground truths by filling 1 to first regions of the two intermediate feature ground truths and filling 0 to remaining regions of the two intermediate feature ground truths, the first regions corresponding to one or more bounding boxes enclosing textual content in the training images.

12. The method of any of claims 1-11, wherein: the input text orientation corresponds to a first angular range of a planar coordinate system; the target text orientation corresponds to a second angular range of the planar coordinate system that is complementary to the first angular range, the input and target text orientations jointly covering an entire plane of the planar coordinate system; and in accordance with a determination that a textual orientation of the textual content is within the first angular range of the planar coordinate system, the textual content is determined to have the input text orientation.

13. The method of any of claims 1-12, wherein: the input text orientation corresponds to a vertical text orientation in a first angular range covering (-135°, -45°) and (45°, 135°) of a planar coordinate system; the target text orientation corresponds to a horizontal text orientation in a second angular range covering (-45°, 45°), (135°, 180°), and (-180°, -135°) of the planar coordinate system; and the input image is rotated clockwise by 90°.

14. The method of any of claims 1-11, wherein: the textual content is determined to have the input text orientation using a classification network, and the classification network is configured to output one of a plurality of text orientations including the input text orientation, the target text orientation, and at least one additional text orientation; each of the plurality of text orientations covers a non-overlapping angular range of a planar coordinate system; and the plurality of text orientations cover an entire plane of the planar coordinate system.

15. The method of any of claims 1-11, wherein determining the rotation angle further comprises selecting one of a plurality of predefined discrete angles as the rotation angle.

16. An electronic device, comprising: one or more processors; and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform a method of any of claims 1-15.

17. A non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the one or more processors to perform a method of any of claims 1-15.

Description:
Masked Bounding-Box Selection for Text Rotation Prediction

TECHNICAL FIELD

[0001] This application relates generally to optical character recognition (OCR) including, but not limited to, methods, systems, and non-transitory computer-readable media for converting textual content in an image to text based on an orientation of the image.

BACKGROUND

[0002] Optical character recognition (OCR) techniques automatically extracts electronic data from printed or written text in a scanned document or an image file. The electronic data is converted into a machine-readable form for further data processing (e.g., editing and searching). Examples of the printed or written text that can be processed by OCR include receipts, contracts, invoices, financial statements, and the like. If implemented efficiently, OCR improves information accessibility for users. Some existing solutions estimate an orientation of the printed or written text based on a series of lines each passing two respective points. However, the number of lines can increase to a level that is beyond control and compromises efficiency of these solutions. It would be beneficial to develop systems and methods for recognizing text in a scanned document or image file in an accurate and efficient manner.

SUMMARY

[0003] Various embodiments of this application are directed to methods, systems, devices, non-transitory computer-readable media for determining an input text orientation of textual content in an input image and setting the input image to a target text orientation based on the input text orientation. The textual content is recognized in an OCR operation from the input image having the target text orientation. In an example, the input text orientation of the textual content is optionally a horizontal direction or a vertical direction. If the input text orientation of the input image is distinct from the target text orientation, the input image is rotated to the target text orientation. In some embodiments, the input text orientation of the input image is vertical, and the target text orientation is horizontal. Rotation of a vertical input image to a horizontal orientation significantly enhances an accuracy level of the subsequent OCR operation. More importantly, in some embodiments, one or more bounding boxes are identified in the input image, and a background noise level associated with the input image is controlled based on the one or more bounding boxes to enhance the accuracy level for determining the input text orientation of the textual content.

[0004] In one aspect, an OCR enhancement method is implemented at an electronic device. The method includes obtaining an input image including textual content to be extracted from the input image, determining that the textual content has an input text orientation that is distinct from a target text orientation, and determining a rotation angle based on the input text orientation and target text orientation. The method further includes rotating the input image by the rotation angle to the target text orientation and recognizing the textual content in the rotated input image having the target text orientation. In some embodiments, recognizing the textual content in the rotated input image includes identifying one or more bounding boxes enclosing the textual content in the input image, cropping the rotated input image according to the one or more bounding boxes to generate a textual content image including the one or more bounding boxes enclosing the textual content, and recognizing the textual content in the textual content image.

[0005] In some embodiments, the method further includes applying a backbone network to identify one or more bounding boxes that closely enclose one or more portions of the textual content and extract a plurality of feature maps from the input image. The plurality of feature maps include one or more intermediate feature maps, and the backbone network is configured to control a background noise level, of the one or more intermediate feature maps, associated with an external portion of the input image that is external to the one or more bounding boxes.

[0007] In some embodiments, the method further includes applying a backbone network to extract a plurality of feature maps from the input image and applying a classification network to an output feature map in the plurality of feature maps to classify the input image to the input text orientation. Further, in some embodiments, the method further includes applying the backbone network to identify one or more bounding boxes that closely enclose one or more portions of the textual content and controlling a background noise level of an output feature map that is used by the classification network to classify the input image. The reduced background noise level corresponds to an external portion of the input image that is external to the one or more bounding boxes.

[0008] In another aspect, some implementations include an electronic device that includes one or more processors and memory having instructions stored thereon, which when executed by the one or more processors cause the processors to perform any of the above methods. [0009] In yet another aspect, some implementations include a non-transitory computer-readable medium, having instructions stored thereon, which when executed by one or more processors cause the processors to perform any of the above methods.

[0010] In various implementations of this application, a binary classification task is implemented to determine the input text orientation of the textual content in the input image. For example, a first value (e.g., “1”) corresponds to a vertical orientation, and a second value (e.g., “0”) corresponds to a horizontal orientation. In some embodiments, a backbone network includes a feature extractor (e.g., ResNet) coupled to a classification network, which acts as a classifier to provide two outputs corresponding to the vertical and horizontal orientations. In some embodiments, bounding boxes are identified in the input image and used as masks to remove background noise, thereby improving prediction results of the input text orientation. Specifically, in some embodiments, intermediate feature maps are generated by the ResNet and regularized to remove the background noise. Mean squared error (MSE) losses of these intermediate feature maps are used with a binary classification loss. In some embodiments, a first portion of ground truths of the intermediate feature maps corresponds to the bounding boxes and has the first value (e.g., “1”), and a second portion of the ground truths corresponds to a remaining portion of the input image and has the second value (e.g., “0”). [0011] These illustrative embodiments and implementations are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof.

Additional embodiments are discussed in the Detailed Description, and further description is provided there.

BRIEF DESCRIPTION OF THE DRAWINGS

[0012] For a better understanding of the various described implementations, reference should be made to the Detailed Description below, in conjunction with the following drawings in which like reference numerals refer to corresponding parts throughout the figures.

[0013] Figure 1 is an example data processing environment having one or more servers communicatively coupled to one or more client devices, in accordance with some embodiments.

[0014] Figure 2 is a block diagram illustrating an electronic device configured to process content data (e.g., image data), in accordance with some embodiments. [0015] Figure 3 is an example data processing environment for training and applying a neural network-based data processing model for processing visual and/or audio data, in accordance with some embodiments.

[0016] Figure 4A is an example neural network applied to process content data in an NN-based data processing model, in accordance with some embodiments, and Figure 4B is an example node in the neural network, in accordance with some embodiments.

[0017] Figures 5A and 5B are an example input image and a textual content image that includes bounding boxes closely enclosing textual content in the input image, in accordance with some embodiments.

[0018] Figure 6 is a flow diagram of an example process for recognizing textual content in an input image based on an input text orientation of the input image, in accordance with some embodiments.

[0019] Figure 7 is a block diagram of an example ResNet-based backbone network, in accordance with some embodiments.

[0020] Figure 8 is a flow diagram of an example image processing method for enhancing OCR based on an image orientation, in accordance with some embodiments.

[0021] Like reference numerals refer to corresponding parts throughout the several views of the drawings.

DETAILED DESCRIPTION

[0022] Reference will now be made in detail to specific embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to assist in understanding the subject matter presented herein. But it will be apparent to one of ordinary skill in the art that various alternatives may be used without departing from the scope of claims and the subject matter may be practiced without these specific details. For example, it will be apparent to one of ordinary skill in the art that the subject matter presented herein can be implemented on many types of electronic devices with digital video capabilities.

[0023] Various embodiments of this application are directed to methods, systems, devices, non-transitory computer-readable media for determining an input text orientation of textual content in an input image and setting the input image to a target text orientation based on the input text orientation. The textual content is recognized in an OCR operation from the input image having the target text orientation. In an example, the input text orientation of the textual content is optionally a horizontal orientation (e.g., covering (45°, 135°) and (-135°, 45°)) or a vertical orientation (e.g., covering (-45°, 45°), (-180°, -135°) and (135°, 180°)). In some embodiments, the input text orientation of the input image is vertical, and the target text orientation is horizontal. Rotation of a vertical input image to a horizontal orientation significantly enhances an accuracy level of the subsequent OCR operation. More importantly, in some embodiments of this application, a background noise level associated with the input image is controlled to enhance the accuracy level for determining the input text orientation of the textual content.

[0024] In some embodiments, the input text orientation of textual content is determined via three operations including (1) feature extraction, (2) binary classification, and (3) intermediate features regularization. In some embodiments, a backbone network includes a feature extractor (e.g., ResNet) and is coupled to a classification network. The backbone network extracts a plurality of feature maps from input images, and acts as a classifier to provide two outputs corresponding to the vertical and horizontal orientations. In some embodiments, bounding boxes are identified in the input image and used as masks to control background noise, thereby improving prediction results of the input text orientation. Specifically, in some embodiments, intermediate feature maps are generated by the ResNet and regularized to remove the background noise. MSE losses of these intermediate feature maps are used with a binary classification loss. In some embodiments, a first portion of ground truths of the intermediate feature maps corresponds to the bounding boxes and has the first value (e.g., “1”), and a second portion of the ground truths corresponds to a remaining portion of the input image and has the second value (e.g., “0”).

[0025] Figure 1 is an example data processing environment 100 having one or more servers 102 communicatively coupled to one or more client devices 104, in accordance with some embodiments. The one or more client devices 104 may be, for example, desktop computers 104A, tablet computers 104B, mobile phones 104C, head-mounted display (HMD) (also called augmented reality (AR) glasses) 104D, or intelligent, multi-sensing, network- connected home devices (e.g., a surveillance camera 104E, a smart television device, a drone). Each client device 104 can collect data or user inputs, executes user applications, and present outputs on its user interface. The collected data or user inputs can be processed locally at the client device 104 and/or remotely by the server(s) 102. The one or more servers 102 provide system data (e.g., boot files, operating system images, and user applications) to the client devices 104, and in some embodiments, processes the data and user inputs received from the client device(s) 104 when the user applications are executed on the client devices 104. In some embodiments, the data processing environment 100 further includes a storage 106 for storing data related to the servers 102, client devices 104, and applications executed on the client devices 104.

[0026] The one or more servers 102 are configured to enable real-time data communication with the client devices 104 that are remote from each other or from the one or more servers 102. Further, in some embodiments, the one or more servers 102 are configured to implement data processing tasks that cannot be or are preferably not completed locally by the client devices 104. For example, the client devices 104 include a game console (e.g., the HMD 104D) that executes an interactive online gaming application. The game console receives a user instruction and sends it to a game server 102 with user data. The game server 102 generates a stream of video data based on the user instruction and user data and providing the stream of video data for display on the game console and other client devices that are engaged in the same game session with the game console. In another example, the client devices 104 include a networked surveillance camera 104E and a mobile phone 104C. The networked surveillance camera 104E collects video data and streams the video data to a surveillance camera server 102 in real time. While the video data is optionally pre-processed on the surveillance camera 104E, the surveillance camera server 102 processes the video data to identify motion or audio events in the video data and share information of these events with the mobile phone 104C, thereby allowing a user of the mobile phone 104C to monitor the events occurring near the networked surveillance camera 104E in the real time and remotely.

[0027] The one or more servers 102, one or more client devices 104, and storage 106 are communicatively coupled to each other via one or more communication networks 108, which are the medium used to provide communications links between these devices and computers connected together within the data processing environment 100. The one or more communication networks 108 may include connections, such as wire, wireless communication links, or fiber optic cables. Examples of the one or more communication networks 108 include local area networks (LAN), wide area networks (WAN) such as the Internet, or a combination thereof. The one or more communication networks 108 are, optionally, implemented using any known network protocol, including various wired or wireless protocols, such as Ethernet, Universal Serial Bus (USB), FIREWIRE, Long Term Evolution (LTE), Global System for Mobile Communications (GSM), Enhanced Data GSM Environment (EDGE), code division multiple access (CDMA), time division multiple access (TDMA), Bluetooth, Wi-Fi, voice over Internet Protocol (VoIP), Wi-MAX, or any other suitable communication protocol. A connection to the one or more communication networks 108 may be established either directly (e.g., using 3G/4G connectivity to a wireless carrier), or through a network interface 110 (e.g., a router, switch, gateway, hub, or an intelligent, dedicated whole-home control node), or through any combination thereof. As such, the one or more communication networks 108 can represent the Internet of a worldwide collection of networks and gateways that use the Transmission Control Protocol/Internet Protocol (TCP/IP) suite of protocols to communicate with one another. At the heart of the Internet is a backbone of high-speed data communication lines between major nodes or host computers, consisting of thousands of commercial, governmental, educational and other computer systems that route data and messages.

[0028] In some embodiments, deep learning techniques are applied in the data processing environment 100 to process content data (e.g., video data, visual data, audio data) obtained by an application executed at a client device 104 to identify information contained in the content data, match the content data with other data, categorize the content data, or synthesize related content data. In these deep learning techniques, data processing models are created based on one or more neural networks to process the content data. These data processing models are trained with training data before they are applied to process the content data. Subsequently to model training, the client device 104 obtains the content data (e.g., captures video data via an internal camera) and processes the content data using the data processing models locally.

[0029] In some embodiments, both model training and data processing are implemented locally at each individual client device 104. The client device 104 obtains the training data from the one or more servers 102 or storage 106 and applies the training data to train the data processing models. Alternatively, in some embodiments, both model training and data processing are implemented remotely at a server 102 (e.g., the server 102 A) associated with a client device 104. The server 102A obtains the training data from itself, another server 102 or the storage 106 applies the training data to train the data processing models. The client device 104 obtains the content data, sends the content data to the server 102A (e.g., in an application) for data processing using the trained data processing models, receives data processing results (e.g., recognized hand gestures) from the server 102 A, presents the results on a user interface (e.g., associated with the application), renders virtual objects in a field of view based on the poses, or implements some other functions based on the results. The client device 104 itself implements no or little data processing on the content data prior to sending them to the server 102A. Additionally, in some embodiments, data processing is implemented locally at a client device 104, while model training is implemented remotely at a server 102 (e.g., the server 102B) associated with the client device 104. The server 102B obtains the training data from itself, another server 102 or the storage 106 and applies the training data to train the data processing models. The trained data processing models are optionally stored in the server 102B or storage 106. The client device 104 imports the trained data processing models from the server 102B or storage 106, processes the content data using the data processing models, and generates data processing results to be presented on a user interface or used to initiate some functions (e.g., rendering virtual objects based on device poses) locally.

[0030] In some embodiments, a pair of AR glasses 104D (also called an HMD) are communicatively coupled in the data processing environment 100. The HMD 104D includes a camera, a microphone, a speaker, one or more inertial sensors (e.g., gyroscope, accelerometer), and a display. The camera and microphone are configured to capture video and audio data from a scene of the HMD 104D, while the one or more inertial sensors are configured to capture inertial sensor data. In some situations, the camera captures hand gestures of a user wearing the HMD 104D, and recognizes the hand gestures locally and in real time using a two-stage hand gesture recognition model. In some situations, the microphone records ambient sound, including user’s voice commands. In some situations, both video or static visual data captured by the camera and the inertial sensor data measured by the one or more inertial sensors are applied to determine and predict device poses. The video, static image, audio, or inertial sensor data captured by the HMD 104D is processed by the HMD 104D, server(s) 102, or both to recognize the device poses. Optionally, deep learning techniques are applied by the server(s) 102 and HMD 104D jointly to recognize and predict the device poses. The device poses are used to control the HMD 104D itself or interact with an application (e.g., a gaming application) executed by the HMD 104D. In some embodiments, the display of the HMD 104D displays a user interface, and the recognized or predicted device poses are used to render or interact with user selectable display items (e.g., an avatar) on the user interface.

[0031] Figure 2 is a block diagram illustrating an electronic system 200 configured to process content data (e.g., image data), in accordance with some embodiments. The electronic system 200 includes a server 102, a client device 104 (e.g., HMD 104D in Figure 1), a storage 106, or a combination thereof. The electronic system 200 , typically, includes one or more processing units (CPUs) 202, one or more network interfaces 204, memory 206, and one or more communication buses 208 for interconnecting these components (sometimes called a chipset). The electronic system 200 includes one or more input devices 210 that facilitate user input, such as a keyboard, a mouse, a voice-command input unit or microphone, a touch screen display, a touch-sensitive input pad, a gesture capturing camera, or other input buttons or controls. Furthermore, in some embodiments, the client device 104 of the electronic system 200 uses a microphone for voice recognition or a camera 260 for gesture recognition to supplement or replace the keyboard. In some embodiments, the client device 104 includes one or more optical cameras 260 (e.g., an RGB camera), scanners, or photo sensor units for capturing images, for example, of graphic serial codes printed on the electronic devices. The electronic system 200 also includes one or more output devices 212 that enable presentation of user interfaces and display content, including one or more speakers and/or one or more visual displays. Optionally, the client device 104 includes a location detection device, such as a GPS (global positioning system) or other geo-location receiver, for determining the location of the client device 104.

[0032] Memory 206 includes high-speed random access memory, such as DRAM, SRAM, DDR RAM, or other random access solid state memory devices; and, optionally, includes non-volatile memory, such as one or more magnetic disk storage devices, one or more optical disk storage devices, one or more flash memory devices, or one or more other non-volatile solid state storage devices. Memory 206, optionally, includes one or more storage devices remotely located from one or more processing units 202. Memory 206, or alternatively the non-volatile memory within memory 206, includes a non-transitory computer readable storage medium. In some embodiments, memory 206, or the non- transitory computer readable storage medium of memory 206, stores the following programs, modules, and data structures, or a subset or superset thereof:

• Operating system 214 including procedures for handling various basic system services and for performing hardware dependent tasks;

• Network communication module 216 for connecting each server 102 or client device 104 to other devices (e.g., server 102, client device 104, or storage 106) via one or more network interfaces 204 (wired or wireless) and one or more communication networks 108, such as the Internet, other wide area networks, local area networks, metropolitan area networks, and so on;

• User interface module 218 for enabling presentation of information (e.g., a graphical user interface for application(s) 224, widgets, websites and web pages thereof, and/or games, audio and/or video content, text, etc.) at each client device 104 via one or more output devices 212 (e.g., displays, speakers, etc.); • Input processing module 220 for detecting one or more user inputs or interactions from one of the one or more input devices 210 and interpreting the detected input or interaction;

• Web browser module 222 for navigating, requesting (e.g., via HTTP), and displaying websites and web pages thereof, including a web interface for logging into a user account associated with a client device 104 or another electronic device, controlling the client or electronic device if associated with the user account, and editing and reviewing settings and data that are associated with the user account;

• One or more user applications 224 for execution by the electronic system 200 (e.g., games, social network applications, smart home applications, and/or other web or non-web based applications for controlling another electronic device and reviewing data captured by such devices), where in some embodiments, the user application(s) 224 include an OCR application for recognizing textual content in an image;

• Model training module 226 for receiving training data and establishing a data processing model for processing content data (e.g., video, image, audio, or textual data) to be collected or obtained by a client device 104;

• Data processing module 228 for processing content data using data processing models 250, thereby identifying information contained in the content data, matching the content data with other data, categorizing the content data, or synthesizing related content data, where in some embodiments, the data processing module 228 is associated with one of the user applications 224 (e.g., an OCR application) and configured to determine an input text orientation of an image and recognize textual content in an image having or rotated to a target text orientation;

• One or more databases 230 for storing at least data including one or more of: o Device settings 232 including common device settings (e.g., service tier, device model, storage capacity, processing capabilities, communication capabilities, etc.) of the one or more servers 102 or client devices 104; o User account information 234 for the one or more user applications 224, e.g., user names, security questions, account history data, user preferences, and predefined account settings; o Network parameters 236 for the one or more communication networks 108, e.g., IP address, subnet mask, default gateway, DNS server and host name; o Training data 238 for training one or more data processing models 240; o Data processing model(s) 240 for processing content data (e.g., video, image, audio, or textual data) using deep learning techniques, where the data processing models 240 includes a machine learning model for determining an input text orientation of an image; and o Content data and results 242 that are obtained by and outputted to the client device 104 of the electronic system 200 , respectively, where the content data is processed by the data processing models 240 locally at the client device 104 or remotely at the server 102 to provide the associated results to be presented on client device 104.

[0033] Optionally, the one or more databases 230 are stored in one of the server 102, client device 104, and storage 106 of the electronic system 200 . Optionally, the one or more databases 230 are distributed in more than one of the server 102, client device 104, and storage 106 of the electronic system 200 . In some embodiments, more than one copy of the above data is stored at distinct devices, e.g., two copies of the data processing models 250 are stored at the server 102 and storage 106, respectively.

[0034] Each of the above identified elements may be stored in one or more of the previously mentioned memory devices, and corresponds to a set of instructions for performing a function described above. The above identified modules or programs (i.e., sets of instructions) need not be implemented as separate software programs, procedures, modules or data structures, and thus various subsets of these modules may be combined or otherwise re-arranged in various embodiments. In some embodiments, memory 206, optionally, stores a subset of the modules and data structures identified above. Furthermore, memory 206, optionally, stores additional modules and data structures not described above.

[0035] Figure 3 is another example of a data processing system 300 for training and applying a neural network based (NN-based) data processing model 240 for processing content data (e.g., video, image, audio, or textual data), in accordance with some embodiments. The data processing system 300 includes a model training module 226 for establishing the data processing model 240 and a data processing module 228 for processing the content data using the data processing model 240. In some embodiments, both of the model training module 226 and the data processing module 228 are located on a client device 104 of the data processing system 300, while a training data source 304 distinct from the client device 104 provides training data 306 to the client device 104. The training data source 304 is optionally a server 102 or storage 106. Alternatively, in some embodiments, the model training module 226 and the data processing module 228 are both located on a server 102 of the data processing system 300. The training data source 304 providing the training data 306 is optionally the server 102 itself, another server 102, or the storage 106. Additionally, in some embodiments, the model training module 226 and the data processing module 228 are separately located on a server 102 and client device 104, and the server 102 provides the trained data processing model 240 to the client device 104.

[0036] The model training module 226 includes one or more data pre-processing modules 308, a model training engine 310, and a loss control module 312. The data processing model 240 is trained according to the type of content data to be processed. The training data 306 is consistent with the type of the content data, so is a data pre-processing module 308 applied to process the training data 306 consistent with the type of the content data. For example, an image pre-processing module 308A is configured to process image training data 306 to a predefined image format, e.g., extract a region of interest (ROI) in each training image, and crop each training image to a predefined image size. Alternatively, an audio pre-processing module 308B is configured to process audio training data 306 to a predefined audio format, e.g., converting each training sequence to a frequency domain using a Fourier transform. The model training engine 310 receives pre-processed training data provided by the data pre-processing modules 308, further processes the pre-processed training data using an existing data processing model 240, and generates an output from each training data item. During this course, the loss control module 312 can monitor a loss function comparing the output associated with the respective training data item and a ground truth of the respective training data item. The model training engine 310 modifies the data processing model 240 to reduce the loss function, until the loss function satisfies a loss criterion (e.g., a comparison result of the loss function is minimized or reduced below a loss threshold). The modified data processing model 240 is provided to the data processing module 228 to process the content data.

[0037] In some embodiments, the model training module 226 offers supervised learning in which the training data is entirely labelled and includes a desired output for each training data item (also called the ground truth in some situations). Conversely, in some embodiments, the model training module 226 offers unsupervised learning in which the training data are not labelled. The model training module 226 is configured to identify previously undetected patterns in the training data without pre-existing labels and with no or little human supervision. Additionally, in some embodiments, the model training module 226 offers partially supervised learning in which the training data are partially labelled. [0038] The data processing module 228 includes a data pre-processing module 314, a model -based processing module 316, and a data post-processing module 318. The data preprocessing modules 314 pre-processes the content data based on the type of the content data. Functions of the data pre-processing modules 314 are consistent with those of the preprocessing modules 308 and covert the content data to a predefined content format that is acceptable by inputs of the model -based processing module 316. Examples of the content data include one or more of the following: video, image, audio, textual, and other types of data. For example, each image is pre-processed to extract an ROI or cropped to a predefined image size, and an audio clip is pre-processed to convert to a frequency domain using a Fourier transform. In some situations, the content data includes two or more types, e.g., video data and textual data. The model-based processing module 316 applies the trained data processing model 240 provided by the model training module 226 to process the pre- processed content data. The model-based processing module 316 can also monitor an error indicator to determine whether the content data has been properly processed in the data processing module 228. In some embodiments, the processed content data is further processed by the data post-processing module 318 to present the processed content data in a preferred format or to provide other related information that can be derived from the processed content data.

[0039] Figure 4A is an exemplary neural network (NN) 400 applied to process content data in an NN-based data processing model 240, in accordance with some embodiments, and Figure 4B is an example of a node 420 in the neural network (NN) 400, in accordance with some embodiments. The data processing model 240 is established based on the neural network 400. A corresponding model-based processing module 316 applies the data processing model 240 including the neural network 400 to process content data that has been converted to a predefined content format. The neural network 400 includes a collection of nodes 420 that are connected by links 412. Each node 420 receives one or more node inputs and applies a propagation function to generate a node output from the node input(s). As the node output is provided via one or more links 412 to one or more other nodes 420, a weight w associated with each link 412 is applied to the node output. Likewise, the node input(s) can be combined based on corresponding weights wi, W2, W3, and W4 according to the propagation function. For example, the propagation function is a product of a non-linear activation function and a linear weighted combination of the node input(s).

[0040] The collection of nodes 420 is organized into one or more layers in the neural network 400. Optionally, the layer(s) may include a single layer acting as both an input layer and an output layer. Optionally, the layer(s) may include an input layer 402 for receiving inputs, an output layer 406 for providing outputs, and zero or more hidden layers 404 (e.g., 404A and 404B) between the input and output layers 402 and 406. A deep neural network has more than one hidden layer 404 between the input and output layers 402 and 406. In the neural network 400, each layer is only connected with its immediately preceding and/or immediately following layer. In some embodiments, a layer 402 or 404B is a fully connected layer because each node 420 in the layer 402 or 404B is connected to every node 420 in its immediately following layer. In some embodiments, one of the hidden layers 404 includes two or more nodes that are connected to the same node in its immediately following layer for down sampling or pooling the nodes 420 between these two layers. Particularly, max pooling uses a maximum value of the two or more nodes in the layer 404B for generating the node of the immediately following layer 406 connected to the two or more nodes.

[0041] In some embodiments, a convolutional neural network (CNN) is applied in a data processing model 240 to process content data (particularly, video and image data). The CNN employs convolution operations and belongs to a class of deep neural networks 400, i.e., a feedforward neural network that only moves data forward from the input layer 402 through the hidden layers to the output layer 406. The hidden layer(s) of the CNN can be convolutional layers convolving with multiplication or dot product. Each node in a convolutional layer receives inputs from a receptive area associated with a previous layer (e.g., five nodes), and the receptive area is smaller than the entire previous layer and may vary based on a location of the convolution layer in the convolutional neural network. Video or image data is pre-processed to a predefined video/image format corresponding to the inputs of the CNN. The pre-processed video or image data is abstracted by each layer of the CNN to a respective feature map. By these means, video and image data can be processed by the CNN for video and image recognition, classification, analysis, imprinting, or synthesis. [0042] Alternatively and additionally, in some embodiments, a recurrent neural network (RNN) is applied in the data processing model 240 to process content data (particularly, textual and audio data). Nodes in successive layers of the RNN follow a temporal sequence, such that the RNN exhibits a temporal dynamic behavior. For example, each node 420 of the RNN has a time-varying real-valued activation. Examples of the RNN include, but are not limited to, a long short-term memory (LSTM) network, a fully recurrent network, an Elman network, a Jordan network, a Hopfield network, a bidirectional associative memory (BAM network), an echo state network, an independently RNN (IndRNN), a recursive neural network, and a neural history compressor. In some embodiments, the RNN can be used for handwriting or speech recognition. It is noted that in some embodiments, two or more types of content data are processed by the data processing module 228, and two or more types of neural networks (e.g., both CNN and RNN) are applied to process the content data jointly.

[0043] The training process is a process for calibrating all of the weights w, for each layer of the learning model using a training data set which is provided in the input layer 402. The training process typically includes two steps, forward propagation and backward propagation, which are repeated multiple times until a predefined convergence condition is satisfied. In the forward propagation, the set of weights for different layers are applied to the input data and intermediate results from the previous layers. In the backward propagation, a margin of error of the output (e.g., a loss function) is measured, and the weights are adjusted accordingly to decrease the error. The activation function is optionally linear, rectified linear unit, sigmoid, hyperbolic tangent, or of other types. In some embodiments, a network bias term b is added to the sum of the weighted outputs from the previous layer before the activation function is applied. The network bias b provides a perturbation that helps the NN 400 avoid over fitting the training data. The result of the training includes the network bias parameter b for each layer.

[0044] Figures 5 A and 5B are an example input image 500 and a textual content image 550 corresponding to the input image 500, in accordance with some embodiments. The textual content image 550 includes one or more bounding boxes (e.g., 612 in Figure 6) closely enclosing textual content 502 in the input image 500. An external portion 550A of the textual content images 550 is external to the one or more bounding boxes, and background noise in the external portion 550A of the textual content images 550 is reduced (e.g., below a threshold noise level) or entirely suppressed. An electronic device obtains the input image 500 and determines that the textual content 502 has an input text orientation that is distinct from a target text orientation. The electronic device determines a rotation angle based on the input text orientation and target text orientation. The input image 500 is rotated by the rotation angle to the target text orientation in which the rotated input image 500 is used to recognize the textual content. In an example, the input text orientation is vertical, and the target text orientation is horizontal. The electronic device determines that a rotation angle is 90° clockwise, and the input image 500 is rotated by 90° clockwise. In some embodiments, the background noise in the external portion 550A of the textual content images 550 is controlled, thereby enhancing an accuracy level of the input text orientation determined for the input image 500. [0045] Specifically, in some embodiments, a backbone network (e.g., 610 in Figure 6) is applied to identify the one or more bounding boxes and extract a plurality of feature maps applied to determine the input text orientation of the textual content in the input image 500. The one or more bounding boxes are applied to define one or more intermediate feature ground truths. During training, the one or more intermediate feature ground truths are applied to train the backbone network, such that the plurality of extracted feature maps inferred by the backbone network focus on the one or more bounding boxes and correspond to the external portion 550A in which the background noise is controlled. Further, in some embodiments, each of the one or more intermediate feature ground truths has non-zero values in corresponding bounding boxes and zero values in corresponding external portion 550A. More details on regularization of the one or more intermediate feature maps are explained below with reference to Figure 6.

[0046] In some embodiments, the backbone network is based on ResNet. Alternatively, in some embodiments, the backbone network is based on a CNN. For example, the backbone network includes one of VGG-16, GhostNet, and EfficientNet. In some embodiments, a server 102 determines a resource level of the electronic device, and selects one of VGG-16, GhostNet, EfficientNet, and ResNet as the backbone network based on the resource level. In some embodiments, the resource level indicates a level of computational and/or storage resources of the electronic device. The server 102 trains the selected backbone network and provides the trained backbone network to the electronic device after training. The electronic device then applies the backbone network to determine the input text orientation.

[0047] Based on the input text orientation, the electronic device determines whether to rotate the textual content and recognizes the textual content 502 in the input image 500. In accordance with a determination that the input text orientation of the input image 500 is consistent with a target text orientation, the electronic device keeps the input text orientation and recognize the textual content 502 in the input image 500 or in the one or more bonding boxes of the input image 502. Alternatively, in accordance with a determination that the input text orientation of the input image 500 is not consistent with the target text orientation, the electronic device rotates the input image 500 by a rotation angle and recognize the textual content 502 in the rotated input image or in the one or more bonding boxes of the rotated input image.

[0048] Figure 6 is a flow diagram of an example process 600 for recognizing textual content 502 in an input image 500 based on an input text orientation 602 of the input image 500, in accordance with some embodiments. The process 600 is implemented at an electronic system 200 that includes a server 102, a client device 104, or a combination thereof, e.g., by a data processing module 228 of the client device 104. The electronic system 200 obtains an input image 500 including the textual content 502 to be extracted from the input image 500, and determines the input text orientation 602 of the input image 500. The input text orientation 602 of the input image 500 is compared with a target text orientation. If the input text orientation 602 is distinct from the target text orientation, the electronic system 200 determines a rotation angle (e.g., 90°) based on the input text orientation and target text orientation, and rotates (604) the input image 500 by the rotation angle to the target text orientation. The textual content 502 is recognized (606) in the rotated input image 500 having the target text orientation. Conversely, if the input text orientation 602 is identical to the target text orientation, the electronic system 200 does not rotate the input image 500, and the textual content 502 is recognized (606) in the input image 500 that has the input text orientation 602 matches the target text orientation. In an example, the target text orientation is a horizontal orientation, and the textual content 502 of the input image 500 having or rotated to the horizontal orientation is recognized with a desirable accuracy level. In another example, the input text orientation (e.g., a vertical orientation 636) and the target text orientation (e.g., a horizontal orientation 638) are complementary to each other.

[0049] The textual content 502 is parallel to a textual line 632, and a corresponding text orientation is measured from a location of the textual line 632 in a planar coordinate system 634 after the textual line 632 is shifted to pass an origin of the planar coordinate system 634. In some embodiments, the input text orientation 602 corresponds to a first angular range 636 of the planar coordinate system 634, and the target text orientation corresponds to a second angular range 638 of the planar coordinate system 634 that is complementary to the first angular range 636. The input and target text orientations jointly cover an entire plane of the planar coordinate system 634. In accordance with a determination that a textual orientation of the textual content 502 is within the first angular range 636 of the planar coordinate system 634, the textual content 502 is determined to have the input text orientation 602 that is distinct from the target text orientation. In some embodiments, the input text orientation 602 corresponds to a vertical text orientation in a first angular range 636 covering (-135°, -45°) and (45°, 135°) of a planar coordinate system 634. The target text orientation corresponds to a horizontal text orientation in a second angular range 638 covering (-45°, 45°), (135°, 180°), and (-180°, -135°) of the planar coordinate system 634. In some embodiments, the input image is rotated clockwise by 90°. Alternatively, in some embodiments, the input image is rotated counter-clockwise by 90°.

[0050] In some embodiments, the electronic system 200 includes a backbone network 608 and a classification network 610 coupled to the backbone network 608. The backbone network 608 is applied to identify one or more bounding boxes 612 that closely enclose one or more portions of the textual content 502 and extract feature maps 616A-616E from the input image 500. In some embodiments, the classification network 610 is applied for binary classification and configured to output one of the input text orientation and the target text orientation. An example of the classification network 610 is a softmax layer configured to converting an output feature map 616E to a vector 602 having a plurality of elements, and each element of the vector 602 indicates a probability of a textual orientation of the textual content 502 being a respective one of a plurality of predefined text orientation.

[0051] In some embodiments, a bounding box 612 includes more than one line of textual content 502. In some embodiments, a bounding box 612 identified by the backbone network 608 includes a single line of textual content 502. In some embodiments, a single line of textual content 502 includes two portions enclosed by two separate bounding boxes 612. In some embodiments, each bounding box 612 is identified and closely enclosing a portion of the textual content 502, and a smallest distance from each pixel of an edge of the bounding box 612 to the portion of the textual content 502 enclosed in the bounding box 612 is less than a threshold space. In some embodiments, the one or more bounding boxes 612 are also used to recognize the textual content 502 in the input image 500. The input image 500 having or rotated to the target text orientation is cropped (614) according to the one or more bounding boxes 612 to generate a textual content image 615 including the one or more bounding boxes 612 enclosing the textual content. The textual content 502 is recognized (606) in the textual content image 615.

[0052] In some embodiments, a subset of the extracted feature maps 616A-616E include one or more intermediate feature maps 618 (e.g., 618A-618E). A background noise level of the one or more intermediate feature maps 618 is automatically controlled, and the background noise level is associated with an external portion 550A of the input image 500 that is external to the one or more bounding boxes 612. The external portion 550A includes a background of the input image 500. Referring to Figure 5A, the external portion 550A includes three fruit images that do not contain any textual content 502. As the background noise level of the one or more intermediate feature maps 618 is automatically controlled, the background noise level of the output feature map 618E that is used by the classification network 610 to classify the input image 500 is also controlled to enhance an accuracy level of determining the input text orientation 602 of the input image 500.

[0053] It is noted that the background noise level of the one or more intermediate feature maps 618 is automatically regularized as the backbone network 608 is applied to process the input image 500. The backbone network 608 has been trained to regularize the background noise level of the one or more intermediate feature maps 618. Specifically, in some embodiments, the electronic system 200 (e.g., a model training module 226 of a server 102) obtains a plurality of training images and classifies the plurality of training images using the backbone network 608 and classification network 610. The electronic system 200 adjusts the backbone network 608 and classification network 610 based on a weighted combination 620 of a binary classification loss 622 associated with the input text orientation 602 outputted by the classification network 610 and a feature loss 624 of at least one of a series of successive intermediate feature maps 618 of the backbone network 608. Further, in some embodiment, the weighted combination 620 is represented by: where L c(s is the binary classification loss 622 and L MSE1 and L MSE2 are two mean squared error (MSE) losses 624 for two of the series of successive intermediate feature maps 618 of the backbone network 608. The electronic system 200 create two intermediate feature ground truths 626 by filling 1 to first regions 628 of the two intermediate feature ground truths 626 and filling 0 to remaining regions 630 of the two intermediate feature ground truths 626. The first regions 628 correspond to one or more bounding boxes 612 enclosing the textual content 502 in the input image 500, and the remaining regions 630 correspond to the external portion 550A of the input image 500.

[0054] Additionally and alternatively, in some embodiment, the electronic system 200 create three or more intermediate feature ground truths 626 by filling 1 to first regions 628 of the three or more intermediate feature ground truths 626 and filling 0 to remaining regions 630 of the three or more intermediate feature ground truths 626. The electronic system 200 adjusts the backbone network 608 and classification network 610 based on the weighted combination 620 of the binary classification loss 622 and the MSE losses 624 for the three or more intermediate feature maps 618. In an example, the electronic system 200 adjusts the backbone network 608 and classification network 610 based on the weighted combination 620 of the binary classification loss 622 and the MSE losses 624 for all of the intermediate feature maps 618. In some embodiments, the series of successive intermediate feature maps 618 have resolutions, e.g., 512x512, 256x256, 128x 128, 64x64, and 32x32, which are scaled successively by a predefined scaling factor.

[0055] In some embodiments, the backbone network 608 or classification network 610 is trained by a server 102 and provided to a client device 104. Alternatively, in some embodiments, the backbone network 608 or classification network 610 is trained and applied by a server 102. For training, the server 102 obtains a plurality of training images and a plurality of associated ground truths (e.g., a ground truth of the input text orientation 602), and classifies the plurality of training images using the backbone network 608 and classification network 610. The server 102 adjusts the backbone network 608 or classification network 610 to match classification results (i.e., input text orientation 602) with the associated ground truths, i.e., based on at least a classification loss 622.

[0056] In some embodiments, the one or more bounding boxes 612 are identified and applied to enhance an accuracy level of determining the input text orientation 602, recognizing the textual content 502 in the input image 500, or both. In some situations, the accuracy level of determining the input text orientation 602 exceeds 99%. Additionally, the backbone network 608 that identifies the bounding boxes 612 can be trained efficiently and effectively. For example, a training time of the backbone network 608 is controlled within one hour if the backbone network 608 is launched on a mobile phone 104C. Particularly, referring to Figure 5B, the textual content image 550 corresponding to an output feature map 616E of the backbone network 608 has a clear external portion 550A, indicating that a background noise has been effectively controlled during regularization of one or more intermediate feature maps 618.

[0057] It is noted that a planar coordinate system 634 is not limited to two complementary angular ranges 636 and 638. In some embodiments not shown, the planar coordinate system 634 includes a plurality of text orientations corresponding to a plurality of angular ranges (e.g., more than 2 angular ranges). For example, the plurality of text orientations include the input text orientation 602, the target text orientation distinct from the input text orientation 602, and at least one additional text orientation. Each of the plurality of text orientations covers a non-overlapping angular range of the planar coordinate system 634. The plurality of text orientations cover an entire plane of the planar coordinate system 634. The textual content 512 is rotated by a rotation angle from any non-target text orientation to the target text orientation. In some embodiments, the rotation angle is selected from a plurality of predefined discrete angles (e.g., clockwise 90°, counter-clockwise 90°). [0058] Figure 7 is a block diagram of an example ResNet-based backbone network 608, in accordance with some embodiments. The backbone network 608 has one or more input layers and four stages, and is configured to receive the input image 500 and generate an output feature map 618E from the input image 500. The backbone network 608 generates a plurality of feature maps 616A-616E including the output feature map 618E. The plurality of feature maps 618A-616E include a subset of intermediate feature maps 618 that are automatically regularized as the backbone network 608 is applied. This occurs because the subset of intermediate feature maps 618 are regularized with respect to a set of intermediate feature ground truths 626 to control a background noise level during training. The background noise level corresponds to an external portion 550A of the input image 500 that is external to the one or more bounding boxes 612, and the set of intermediate feature ground truths 626 are defined based on the one or more bounding boxes 612 to control the background noise level.

[0059] Figure 8 is a flow diagram of an example image processing method 800 for enhancing OCR based on an image orientation, in accordance with some embodiments. For convenience, the method 800 is implemented by at least an electronic device (e.g., a data processing module 228 of a mobile phone 104C). Method 800 is, optionally, governed by instructions that are stored in a non-transitory computer readable storage medium and that are executed by one or more processors of the computer system. Each of the operations shown in Figure 8 may correspond to instructions stored in a computer memory or non-transitory computer readable storage medium (e.g., memory 206 in Figure 2). The computer readable storage medium may include a magnetic or optical disk storage device, solid state storage devices such as Flash memory, or other non-volatile memory device or devices. The instructions stored on the computer readable storage medium may include one or more of: source code, assembly language code, object code, or other instruction format that is interpreted by one or more processors. Some operations in method 800 may be combined and/or the order of some operations may be changed.

[0060] The electronic device obtains (802) an input image 500 including textual content 502 to be extracted from the input image 500 and determines (804) that the textual content 502 has an input text orientation 602 that is distinct from a target text orientation. A rotation angle is determined (806) based on the input text orientation 602 and target text orientation. The electronic device rotates (808) the input image 500 by the rotation angle to the target text orientation and recognizes (810) the textual content 502 in the rotated input image 500 having the target text orientation. In some embodiments, the electronic device recognizes the textual content 502 in the rotated input image 500 by identifying (812) one or more bounding boxes 612 enclosing the textual content 502 in the input image 500, cropping (814) the rotated input image 500 according to the one or more bounding boxes 612 to generate a textual content image 615 including the one or more bounding boxes 612 enclosing the textual content 502, and recognizing (816) the textual content 502 in the textual content image 615. In an example, the target text orientation is a horizontal orientation 638, and the input text orientation 602 is a vertical orientation 636.

[0061] Particularly, in some embodiments, the electronic device applies a backbone network 608 to identify one or more bounding boxes 612 that closely enclose one or more portions of the textual content 502 and extract a plurality of feature maps 616 from the input image 500. The plurality of feature maps 616 include one or more intermediate feature maps 618, and the backbone network 608 is configured to control a background noise level, of the one or more intermediate feature maps 618, associated with an external portion of the input image 500 that is external to the one or more bounding boxes 612. By these means, when the one or more intermediate feature maps 618 are applied to determine the input text orientation 602, the input text orientation 602 can be determined with an enhanced accuracy level.

[0062] In some embodiments, the electronic device applies (818) a backbone network 608 to extract plurality of feature maps 616 from the input image 500 and applies (822) a classification network 610 to an output feature map in the plurality of feature maps 616 to classify the input image 500 to the input text orientation 602. Optionally, the plurality of feature maps 616 include (820) a series of successive intermediate feature maps 618. In some embodiments, the electronic device identifies one or more bounding boxes 612 that closely enclose one or more portions of the textual content 502 and controls a background noise level of an output feature map that is used by the classification network 610 to classify the input image 500. The reduced background noise level corresponds to an external portion 550A of the input image 500 that is external to the one or more bounding boxes 612. In an example, the background noise level corresponding to the external portion 550A of the input image 500 is controlled below a threshold noise level.

[0063] In some embodiments, the classification network 610 is applied for binary classification and configured to output one of the input text orientation 602 and the target text orientation, and the input text orientation 602 and the target text orientation are complementary to each other. For example, the input text orientation 602 is represented by a vector having two elements corresponding to a horizontal orientation and a vertical orientation. Additionally, in some embodiments, a server or the electronic device determines (824) a resource level of the electronic device and selects (826) one of VGG-16, GhostNet, EfficientNet, and ResNet as the backbone network 608 based on the resource level.

[0064] In some embodiments, a server or the electronic device trains the backbone network 608 and classification network 610. During this process, the server or electronic device obtains a plurality of training images and a plurality of associated ground truths, classifies the plurality of training images using the backbone network 608 and classification network 610, and adjusts the backbone network 608 and classification network 610 to match the classification results with the associated ground truths.

[0065] Alternatively, in some embodiments, the backbone network 608 and classification network 610s are trained by obtaining a plurality of training images, classifying the plurality of training images using the backbone network 608 and classification network 610, and adjusting the backbone network 608 and classification network 610 based on a weighted combination 620 of a classification loss 622 and a feature loss 624 of at least one of the series of successive intermediate feature maps 618 that are generated from the training images by the backbone network 608. Additionally, in some embodiments, the weighted combination 620 is represented by: where L c(s is a binary classification loss 622 and L MSE1 and L MSE2 are two MSE losses 624 for two of the series of successive intermediate feature maps 618 of the backbone network 608. The server 102 electronic device identifies one or more bounding boxes 612 enclosing textual content 502 of the training images and creates two intermediate feature ground truths 626 by filling 1 to first regions 628 of the two intermediate feature ground truths 626 and filling 0 to remaining regions 630 of the two intermediate feature ground truths 626. The one or more first regions 628 correspond to the one or more bounding boxes 612 enclosing the textual content 502 in the training images.

[0066] In some embodiments, the plurality of feature maps 616 include (820) a series of successive intermediate feature maps 618, and the series of successive intermediate feature maps 618 of the backbone network 608 have resolutions that are scaled successively by a scaling factor. For example, the resolutions of the series of successive intermediate feature maps 618 are 512x512, 256x256, 128x 128, 64x64, and 32x32.

[0067] In some embodiments, the input text orientation 602 corresponds to a first angular range 636 of a planar coordinate system 634. The target text orientation corresponds to a second angular range 638 of the planar coordinate system 634 that is complementary to the first angular range 636. The input and target text orientations jointly cover an entire plane of the planar coordinate system 634. In accordance with a determination that a textual orientation of the textual content 502 is within the first angular range of the planar coordinate system 634, the textual content 502 is determined to have the input text orientation 602.

[0068] In some embodiments, the input text orientation 602 corresponds to a vertical text orientation in a first angular range 636 covering (-135°, -45°) and (45°, 135°) of a planar coordinate system 634. The target text orientation corresponds to a horizontal text orientation in a second angular range 638 covering (-45°, 45°), (135°, 180°), and (-180°, -135°) of the planar coordinate system 634. The input image 500 is rotated clockwise by 90°.

[0069] In some embodiments, the textual content 502 is determined to have the input text orientation 602 using a classification network, and the classification network is configured to output one of a plurality of text orientations including the input text orientation 602, the target text orientation, and at least one additional text orientation. Each of the plurality of text orientations covers a non-overlapping angular range of a planar coordinate system 634. The plurality of text orientations cover an entire plane of the planar coordinate system 634.

[0070] In some embodiments, the rotation angle is determined by selecting one of a plurality of predefined discrete angles as the rotation angle.

[0071] It should be understood that the particular order in which the operations in Figure 8 have been described are merely exemplary and are not intended to indicate that the described order is the only order in which the operations could be performed. One of ordinary skill in the art would recognize various ways to determine an input text orientation and recognize textual content in an image based on the input text orientation. Additionally, it should be noted that details of other processes described above with respect to Figures 1-6 are also applicable in an analogous manner to method 800 described above with respect to Figure 8. For brevity, these details are not repeated here.

[0072] The terminology used in the description of the various described implementations herein is for the purpose of describing particular implementations only and is not intended to be limiting. As used in the description of the various described implementations and the appended claims, the singular forms “a”, “an” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will also be understood that the term “and/or” as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items. It will be further understood that the terms “includes,” “including,” “comprises,” and/or “comprising,” when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Additionally, it will be understood that, although the terms “first,” “second,” etc. may be used herein to describe various elements, these elements should not be limited by these terms. These terms are only used to distinguish one element from another.

[0073] As used herein, the term “if’ is, optionally, construed to mean “when” or “upon” or “in response to determining” or “in response to detecting” or “in accordance with a determination that,” depending on the context. Similarly, the phrase “if it is determined” or “if [a stated condition or event] is detected” is, optionally, construed to mean “upon determining” or “in response to determining” or “upon detecting [the stated condition or event]” or “in response to detecting [the stated condition or event]” or “in accordance with a determination that [a stated condition or event] is detected,” depending on the context.

[0074] The foregoing description, for purpose of explanation, has been described with reference to specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the claims to the precise forms disclosed. Many modifications and variations are possible in view of the above teachings. The embodiments were chosen and described in order to best explain principles of operation and practical applications, to thereby enable others skilled in the art.

[0075] Although various drawings illustrate a number of logical stages in a particular order, stages that are not order dependent may be reordered and other stages may be combined or broken out. While some reordering or other groupings are specifically mentioned, others will be obvious to those of ordinary skill in the art, so the ordering and groupings presented herein are not an exhaustive list of alternatives. Moreover, it should be recognized that the stages can be implemented in hardware, firmware, software or any combination thereof.




 
Previous Patent: POWER SUPPLIES FOR COMPUTE CORES

Next Patent: REMOVABLE DOOR LOCK