Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHODS AND SYSTEMS FOR EXTRACTING TEXT AND SYMBOLS FROM DOCUMENTS
Document Type and Number:
WIPO Patent Application WO/2023/279186
Kind Code:
A1
Abstract:
A text recognition system comprising a processor, a memory device storing instructions which, when executed, cause the processor to perform operations comprising to at least: with a Graph Convolutional Neural Network (GCN) and a feature pyramid network (FPN), extract at least one visual feature from the input image; generate at least one of a predicted Text Region Map (TR) and a Text Center Line (TCL) map; aggregate information from the maps to suppress false positive areas in the TCL Map using the TR Map and GCN node classification comprising weakly supervised text segments classification and segment link prediction; minimize error accumulation in the TR and TCL maps using dense overlapping text segments and the segment link prediction; and approximate the shape of the text by integrating the aggregated information from the maps and the segment link prediction; and recognize text with a Bidirectional long short term memory with attention mechanism.

Inventors:
CUI TINGCHENG (CA)
XU CHENGPEI (CA)
Application Number:
PCT/CA2021/050919
Publication Date:
January 12, 2023
Filing Date:
July 06, 2021
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
ORBISEED TECH INC (CA)
International Classes:
G06N3/08
Domestic Patent References:
WO2021068061A12021-04-15
Foreign References:
CN111507250A2020-08-07
CN108549893A2018-09-18
US20190272438A12019-09-05
CN106446899A2017-02-22
CN112487848A2021-03-12
Attorney, Agent or Firm:
SABETA, Anton C. et al. (CA)
Download PDF:
Claims:
CLAIMS

1. A method for text detection and recognition, wherein the method is implemented by a computing device comprising a processor, a non-transitory computer-readable medium storing instructions which, when executed, cause the processor to perform operations comprising to at least: receive an image; extract features from the image; determine on the image a text area having text; up-sample the extracted features at each prediction layer of a feature pyramid network (FPN) aggregate features from the FPN with a Graph Convolutional Neural Network (GCN); generate a plurality of text feature maps comprising the text area; generate dense, overlapping rectangular text segments; classify each generated text segments as at least one of a text segment, a non text segment, and a text interval; train a neural network to learn a linkage relationship and node types associated with the text segments; generate at least one of a predicted Text Region Map (TR), a Text Center Line (TCL), a Height map, and an Angle map; during an inference stage, combine the plurality of text maps comprising the predicted Text Region Map (TR) and the predicted Text Center Line (TCL), the linkage relationship and node types, thereby minimizing false positive text areas; approximate a contour of the connected snake-shaped text segments to obtain a location of the text; use a Bidirectional long short term memory (Bi-LSTM) with attention mechanism to recognize the text according to the location of the text; and outputting the recognized text on a display screen.

2. The method of claim 1, wherein the dense overlapping text segments are generated by generating small rectangles on every pixel of text areas in the TCL map with geometric features from the Height and the Angle map.

3. The method of claim 2, wherein each text segment is represented as a rectangle with a geometric feature (x, y, h, w, Q ), where x, y are the coordinates of the center points, h is the height, w is the width and Q is the rotation angle of the rectangle.

4. The method of claim 3, wherein a Width map is automatically obtained by adding a clip function on 1 3 x Heightmap, and maintaining the width between 2-6 pixels.

5. The method of claim 4, wherein a Non Maximum Suppression (NMS) algorithm with 0.5 for the threshold of IoU is applied on the TCL map to remove the duplicated bounding boxes, thereby ensuring the density of the text segments and the connectivity of each text segment.

6. The method of claim 4, wherein the types of each text segment in the dense, overlapping text segments is determined.

7. The method of claim 6, wherein the text segments are categorized into at least one of a char segment, interval segment, and a non-text segments.

8. The method of claim 7, wherein the text segments are annotated to enable GCN classification ability for further FPs suppression.

9. The method of claim 8, wherein after obtaining the char segments, interval segments and non-text segments are annotated for a synthetic dataset for training a GCN layer to classify the type of the text segments.

10. The method of claim 9, wherein for the dataset that does not have char-level annotation, a pre-trained model is employed on the synthetic dataset to predict the types of the text segments.

11. The method of claim 10, wherein the accuracy of classifying non-text segments and interval segments is verified by the TR map.

12. The method of claim 11, wherein the correct annotation results are added to the training dataset in an iterative manner.

13. The method of claim 12, wherein a link between the text segments is predicted by integrating a feature expression of each text segment and constructing the graph structure between the text segments, such that the type of each text segment is classified.

14. The method of claim 13, wherein the link predictions are employed to combine text segments that have a strong relationship and weakly supervised node classification results are used to suppress false positive text areas that inherited from a previous a feature pyramid network (FPN) layer.

15. The method of claim 14, wherein a false positive suppression method minimizes error accumulation, wherein the false positive suppression method employs a plurality of feature maps and determines grouping inference of densely designed text segments with regards of GCN’s node classification and the link predictions.

16. The method of claim 14, wherein the false positive text areas are suppressed by excavating Graph Convolutional Neural Network (GCN)’s node classification ability on a re-designed dense, overlapped text segments modelling without char- level annotation.

17. The method of claim 15, wherein a shape-approximation module is employed for arbitrary shaped text detection.

18. The method of claim 16, wherein a contour inference module approximates a contour of the connected snake-shaped text segments.

19. An indicia recognition system comprising a processor, a memory device storing instructions which, when executed, cause the processor to perform operations comprising to at least: acquire a raw image from an image source; augment the raw image to generate an input image; with a Graph Convolutional Neural Network (GCN) and a feature pyramid network (FPN), extract at least one visual feature from the input image; determine on the image a text area having text; generate at least one of a predicted Text Region Map (TR), a Text Center Line (TCL) map, a Height map, and an Angle map; aggregating information from the maps to suppress false positive areas in the Text Center Line (TCL) Map using the Text Region (TR) Map and GCN node classification comprising weakly supervised text segments classification and segment link prediction; minimize error accumulation in the TR maps and TCL maps using dense overlapping text segments and the segment link prediction; approximate the shape of the text by integrating the aggregated information from the maps and the segment link prediction; locate a location of the text based on a contour of the connected snake-shaped text segments; use a Bidirectional long short term memory (Bi-LSTM) with attention mechanism to recognize the text according to the location of the text.

20. The indicia recognition system claim 19, wherein the dense overlapping text segments are generated by generating small rectangles on every pixel of text areas in the TCL map with geometric features from the Height and the Angle map.

21. The indicia recognition system of claim 20, wherein each text segment is represented as a rectangle with a geometric feature (x, y, h, w, Q ), where x, y are the coordinates of the center points, h is the height, w is the width and Q is the rotation angle of the rectangle.

22. The indicia recognition system of claim 21, wherein a Width map is automatically obtained by adding a clip function on 1 3 x Heightmap, and maintaining the width between 2-6 pixels.

23. The indicia recognition system of claim 22, wherein a Non Maximum Suppression (NMS) algorithm with 0.5 for the threshold of IoU is applied on the TCL map to remove the duplicated bounding boxes, thereby ensuring the density of the text segments and the connectivity of each text segment.

24. The indicia recognition system of claim 22, wherein the types of each text segment in the dense, overlapping text segments is determined.

25. The indicia recognition system of claim 24, wherein the text segments are categorized into at least one of a char segment, interval segment, and a non-text segments.

26. The indicia recognition system of claim 25, wherein the text segments are annotated to enable GCN classification ability for further FPs suppression.

27. The indicia recognition system of claim 26, wherein after obtaining the char segments, interval segments and non-text segments are annotated for a synthetic dataset for training a GCN layer to classify the type of the text segments.

28. The indicia recognition system of claim 27, wherein for the dataset that does not have char-level annotation, a pre-trained model is employed on the synthetic dataset to predict the types of the text segments.

29. The indicia recognition system of claim 28, wherein the accuracy of classifying non-text segments and interval segments is verified by the TR map.

30. The indicia recognition system of claim 29, wherein the correct annotation results are added to the training dataset in an iterative manner.

31. The indicia recognition system of claim 30, wherein a link between the text segments is predicted by integrating a feature expression of each text segment and constructing the graph structure between the text segments, such that the type of each text segment is classified.

32. The indicia recognition system of claim 31, wherein the link predictions are employed to combine text segments that have a strong relationship and weakly supervised node classification results are used to suppress false positive text areas that inherited from a previous a feature pyramid network (FPN) layer.

33. The indicia recognition system of claim 32, wherein a false positive suppression method minimizes error accumulation, wherein the false positive suppression method employs a plurality of feature maps and determines grouping inference of densely designed text segments with regards of GCN’s node classification and the link predictions.

34. The indicia recognition system of claim 32, wherein the false positive text areas are suppressed by excavating Graph Convolutional Neural Network (GCN)’s node classification ability on a re-designed dense, overlapped text segments modelling without char-level annotation.

35. The indicia recognition system of claim 33, wherein a shape-approximation module is employed for arbitrary shaped text detection.

36. The indicia recognition system of claim 34, wherein a contour inference module approximates a contour of the connected snake-shaped text segments.

37. An indicia recognition system comprising a processor, a memory device storing instructions which, when executed, cause the processor to perform operations comprising to at least: acquire a raw image from an image source; augment the raw image to generate an input image; with a Graph Convolutional Neural Network (GCN) and a feature pyramid network (FPN), extract at least one visual feature from the input image; determine on the image a text area having text; generate at least one of a predicted Text Region Map (TR), a Text Center Line (TCL) map, a Height map, and an Angle map; apply the at least one of the predicted Text Region Map (TR) and the Text Center Line (TCL) map to generate bounding boxes around text segments; apply the Text Center Line (TCL) map to generate dense overlapping text segments and applying a Non Maximum Suppression (NMS) algorithm to remove duplicate bounding boxes; train the Graph Convolutional Neural Network (GCN) layer using a weakly supervised training method to predict a relationship between each of the text segments to form at least one relationship feature and classify each generated text segments as at least one of a text segment, a non-text segment, and a text interval; during an inference stage based on the at least one visual feature and at least one relationship feature of text segments, combine the plurality of text maps comprising the predicted Text Region Maps (TR) and the predicted Text Center Line (TCL), the linkage relationship and node types, wherein the linkage relationship is used to combine text segments that have a close relationship; employing the weakly supervised node classification results to suppress the false positive text areas that inherited from the previous FPN layer, such that the graph network minimizes error accumulation; approximating a contour of the connected snake-shaped text segments to obtain a text location using a text instance based on a contour inference module; using a Bidirectional long short term memory (Bi-LSTM) with attention mechanism to recognize the text according to the text location; and outputting the recognized text on a display screen.

38. An indicia recognition system comprising: a data acquisition module for acquiring an image having text; a contour inference module generate bounding boxes around text segments within an augmented image and approximate a contour of connected snake-shaped text segments is approximated to obtain a location of the text based on the contour; a Graph Convolutional Neural Network (GCN) layer using a weakly supervised training method to predict a relationship between each of the text segments to form at least one relationship feature and classify each generated text segments as at least one of a text segment, a non-text segment, and a text interval; and a recognition module comprising a model that comprises one or more long short-term memory network (LSTM) layers with attention mechanism to recognize the text according to the text location.

39. A memory device storing instructions which, when executed a processor, cause the processor to perform operations comprising to at least: acquire a raw image from an image source and augment the raw image to determine on the image a text area having text; generate at least one text map from the text area; generate dense, overlapping rectangular text segments from the text area; with a neural network, using a weakly supervised learning method to generate classifications for each of the text segments comprising text segments, non-text segments, and text intervals; input the classifications for each of the text segments into the information into neural network as nodes to train the network to leam a linkage relationship between each of the text segments and node types; combine the at least one text map, the linkage relationship and the classifications for each of the text segments during an inference stage; approximate a contour of the connected snake-shaped text segments to obtain a text location using a text instance based on a contour inference module; use a Bidirectional long short term memory (Bi-LSTM) with attention mechanism to recognize the text according to the text location.

Description:
METHODS AND SYSTEMS FOR EXTRACTING TEXT AND SYMBOLS FROM DOCUMENTS

FIELD

[0001] Aspects of this disclosure relate to optical character recognition methods, more particularly it relates to extracting text and symbols from documents. BACKGROUND

[0002] Photographs, CAD drawings, laser scans, video recordings of buildings capture substantial information and are critical for engineering, insurance, property management, real estate investment and more. This information is used for applications such as building management, maintenance, insurance, code compliance checks, maintenance and more. Text and symbols in these documents may be of paramount importance and it would be desirable to extract this information, however obtaining some information is typically performed by a manual process that involves human decision making, and therefore such a process may be challenging to automate. The output contains information about the content of each text block and its locations within the image.

SUMMARY

[0003] In one example, there is provided a method for text detection and recognition, wherein the method is implemented by a computing device comprising a processor, a non-transitory computer-readable medium storing instructions which, when executed, cause the processor to perform operations comprising to at least: receive an image comprising; extract features from the image; determine on the image a text area having text; up-sample the extracted features at each prediction layer of a feature pyramid network (FPN) aggregate features from the FPN with a Graph Convolutional Neural Network (GCN); generate a plurality of text feature maps comprising the text area; generate dense, overlapping rectangular text segments; classify each generated text segments as at least one of a text segment, a non text segment, and a text interval; train a neural network to learn a linkage relationship and node types associated with the text segments; generate at least one of a predicted Text Region Map (TR), a Text Center Line (TCL), a Height map, and an Angle map; during an inference stage, combine the plurality of text maps comprising the predicted Text Region Map (TR) and the predicted Text Center Line (TCL), the linkage relationship and node types, thereby minimizing false positive text areas; approximate a contour of the connected snake-shaped text segments to obtain a location of the text; use a Bidirectional long short term memory (Bi-LSTM) with attention mechanism to recognize the text according to the location of the text; and outputting the recognized text on a display screen.

[0004] In another example, there is provided an indicia recognition system comprising a processor, a memory device storing instructions which, when executed, cause the processor to perform operations comprising to at least: acquire a raw image from an image source; augment the raw image to generate an input image; with a Graph Convolutional Neural Network (GCN) and a feature pyramid network (FPN), extract at least one visual feature from the input image; determine on the image a text area having text; generate at least one of a predicted Text Region Map (TR), a Text Center Line (TCL) map, a Height map, and an Angle map; aggregating information from the maps to suppress false positive areas in the Text Center Line (TCL) Map using the Text Region (TR) Map and GCN node classification comprising weakly supervised text segments classification and segment link prediction; minimize error accumulation in the TR maps and TCL maps using dense overlapping text segments and the segment link prediction; approximate the shape of the text by integrating the aggregated information from the maps and the segment link prediction; locate a location of the text based on a contour of the connected snake-shaped text segments; use a Bidirectional long short term memory (Bi-LSTM) with attention mechanism to recognize the text according to the location of the text.

[0005] In one example, there is provided an indicia recognition system comprising a processor, a memory device storing instructions which, when executed, cause the processor to perform operations comprising to at least: acquire a raw image from an image source; augment the raw image to generate an input image; with a Graph Convolutional Neural Network (GCN) and a feature pyramid network (FPN), extract at least one visual feature from the input image; determine on the image a text area having text; generate at least one of a predicted Text Region Map (TR), a Text Center Line (TCL) map, a Height map, and an Angle map; apply the at least one of the predicted Text Region Map (TR) and the Text Center Line (TCL) map to generate bounding boxes around text segments; apply the Text Center Line (TCL) map to generate dense overlapping text segments and applying a Non Maximum Suppression (NMS) algorithm to remove duplicate bounding boxes; train the Graph Convolutional Neural Network (GCN) layer using a weakly supervised training method to predict a relationship between each of the text segments to form at least one relationship feature and classify each generated text segments as at least one of a text segment, a non-text segment, and a text interval; during an inference stage based on the at least one visual feature and at least one relationship feature of text segments, combine the plurality of text maps comprising the predicted Text Region Maps (TR) and the predicted Text Center Line (TCL), the linkage relationship and node types, wherein the linkage relationship is used to combine text segments that have a close relationship; employing the weakly supervised node classification results to suppress the false positive text areas that inherited from the previous FPN layer, such that the graph network minimizes error accumulation; approximate a contour of the connected snake-shaped text segments to obtain the final results; and outputting the recognized text on a display screen.

[0006] In another example, there is provided an indicia recognition system comprising: a data acquisition module for acquiring an image having text; a contour inference module generate bounding boxes around text segments within an augmented image and approximate a contour of connected snake-shaped text segments is approximated to obtain a location of the text based on the contour; a Graph Convolutional Neural Network (GCN) layer using a weakly supervised training method to predict a relationship between each of the text segments to form at least one relationship feature and classify each generated text segments as at least one of a text segment, a non-text segment, and a text interval; and a recognition module comprising a model that comprises one or more long short-term memory network (LSTM) layers with attention mechanism to recognize the text according to the text location.

[0007] In another example, there is provided a memory device storing instructions which, when executed a processor, cause the processor to perform operations comprising to at least: acquire a raw image from an image source and augment the raw image to determine on the image a text area having text; generate at least one text map from the text area; generate dense, overlapping rectangular text segments from the text area; with a neural network, using a weakly supervised learning method to generate classifications for each of the text segments comprising text segments, non-text segments, and text intervals; input the classifications for each of the text segments into the information into neural network as nodes to train the network to learn a linkage relationship between each of the text segments and node types; combine the at least one text map, the linkage relationship and the classifications for each of the text segments during an inference stage; approximate a contour of the connected snake-shaped text segments to obtain a text location using a text instance based on a contour inference module; use a Bidirectional long short term memory (Bi-LSTM) with attention mechanism to recognize the text according to the text location.

[0008] Advantageously, the methods and systems recognize all of the text or symbol content within an image, even when the text and symbols are tilted, rotated, blocked and in most cases, have arbitrary shapes. The method comprises the steps of resizing the images and padding all the images them to fit in the same shape, then undertaking a text detection process by feeding the image through a series of processing pipeline involving a ChilopodNet model, Graph Convolutional Network model. Pathfinding algorithms are used to obtain the bounding boxes of text characters, gap space between characters, and their order within a string of text characters. Next, text recognition is performed by a recognition processing pipeline, and the output with information regarding the content of each text/symbol block and its locations within the image.

[0009] The methods and systems may be employed in a variety of applications. For example, in the construction industry, they may be used to quickly and accurately identify costly design errors from hundreds of documents and summarizes the results at least 200 times faster than the manual process, thus giving construction engineers a fast way to validate designs and reduce errors. In another example, a document extraction algorithm, which uses a combination of natural language processing, computer vision and machine learning is capable of converting complex engineering documents into meaningful data.

BRIEF DESCRIPTION OF THE DRAWINGS

[0010] Figure 1 shows a high level architecture block diagram of a system for text detection;

[0011] Figure 2a shows an acquired raw image;

[0001] Figure 2b shows a text region map corresponding to the image of

Figure 2a; [0002] Figure 2c shows a text center line map corresponding to the image of

Figure 2a;

[0003] Figure 2d shows overlapping rectangular text segments;

[0004] Figure 3 a shows the raw input image of Figure 2a combined with text bounding boxes;

[0005] Figure 3b shows a grouped text image;

[0006] Figure 3c shows positional coordinates of the recognized text blocks;

[0007] Figure 4a shows recognized text blocks within an engineering drawing;

[0008] Figure 4b shows recognized symbols within an engineering drawing;

[0009] Figure 4c shows detected elements within an engineering drawing;

[0010] Figure 5 shows a flowchart with exemplary steps for recognizing text from an image;

[0011] Figure 6a shows a preprocessing workflow;

[0012] Figure 6b shows a detection workflow; and

[0013] Figure 6c shows a recognition workflow.

DESCRIPTION

[0014] Figure 1 is a schematic block diagram of an example of a physical environment for an indicia detection system 10, according to some embodiments. System 10 is configured for processing machine learning architectures such as neural networks, receiving a machine learning architecture (e.g., stored in the form of datasets) from a machine learning architecture source is provided. Data ingestion module 12 ingests, or receives, input data 14 that comprises image data having text therein.

[0015] The input data 14 is applied to a plurality of data processing modules

19 within computing environment 20 that also includes one or more processors 30 and a plurality of software and hardware components. The one or more processors 30 and plurality of software and hardware components are configured to execute program instructions to perform the functions of text detection module 16 described herein and embodied in the one or more data processing modules 19. [0016] The plurality of data processing modules 19 include, in addition to the data ingestion module 12 and contour inference module 30 and shape-approximation module 32, recognition module 34, and machine learning module 40 configured to apply a layer of artificial intelligence to detect and recognize text blocks within images. Shape-approximation module 30 is employed for arbitrary shaped text detection.

[0017] Machine learning module 40 comprises model training module 70, which may be software (e.g., code segments compiled into machine code), hardware, embedded firmware, or a combination of software and hardware, according to various embodiments. Model training module 70 is configured to receive one or more datasets representative of a neural network model, random forest, or other machine learning models, and to train the machine learning model using a step-size value which varies over time. Generally, the machine learning model any other relevant data or hyperparameters are stored in memory 80, which is configured to maintain one or more datasets, including data structures storing linkages and other data. Memory 80 may include a relational database, aflat data storage, anon-relational database, among others. In some embodiments, memory 80 may store data representative of a model distribution set including one or more modified models based on a machine learning model, including rule memory 82.

[0018] Examples of machine learning model include Fully Connected Neural

Networks (FCNNs), Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM) networks, deep belief networks, random forests, or support vector machines. [0019] An example of machine learning module 40 is one or more relatively specialized hardware elements operating in conjunction with one or more software elements to train a neural network and/or perform inference with a neural network relatively more efficiently than using relatively less specialized hardware elements. Some implementations of the relatively specialized hard-ware elements include one or more hardware logic circuitry elements such as transistors, resistors, inductors, capacitors, wire interconnects, combinatorial logic (e.g., NAND, NOR) gates, latches, register files, memory arrays, tags for memory arrays, content-addressable memories, flash, ROM, DRAM, SRAM, Serializer/Deserializer (SerDes), I/O drivers, and the like, such as implemented via custom logic, synthesized logic, ASICs, and/or FPGAs. Some of the relatively less specialized hardware elements include conventional CPUs and conventional GPUs. In one exemplary implementation, machine learning module 40 is enabled to process dataflow in accordance with computations performed for training of a neural network and/or inference with a neural network.

[0020] Looking at Figure 2a, there is shown an acquired raw image 100 with two PVC pipes 102, 104 with text inscriptions 106ai, 106bi, 106ci, and 106di. Raw image 100 may be acquired from an image source, such as a 3D scan, a video or a photograph. When raw image 100 is acquired from a 3D scan, raw image 100 is flattened into a single layer image, when raw image 100 is acquired from a video, raw image 14 is extracted as a frame of the video. Raw image 14 may be resized, manipulated, or augmented with any number of effects to generate input data 14. [0021] Data ingestion module 12 receives input data 14 and text detection module 16 determines text areas 106a2, 106b2, 106c2, and 106d2 containing text, as shown Figure 2b and generates a text region map 110 with bounding areas 106a 3 , 106b 3 , 106C 3, and 106d 3 . Text center line map 112 is also generated and comprises areas 106a2, 1061¾, 106c ¾ and 106d2 , as shown in Figure 2c.

[0022] Looking at Figure 2d, machine learning module 40 generates bounding boxes 114a-d comprising dense, overlapping rectangular text segments 106a4n, 106b 4n , 106c4n, and 106d4n surrounding each of the detected text characters within text areas 106a2, 1061¾ 106c2, and 106d2, and generates classifications for each of text segments 106a4n, 1061¾ h , 106c4n, and 106d4n. The classifications may include text segments 106a4n, 1061¾ h , 106c4n, and 106d4n, non-text segments, and text intervals. [0023] The classifications for each of text segments 106a4n, 106b4n, 106c4n, and 106d 4n are input as nodes into a neural network model associated with training module 70. For example, in a weakly supervised learning method the neural network is trained to learn a linkage relationship between each of text segments 106a 4n , 1061¾ h , 106c4n, and 106d4n and node types, and thereby suppress false-positive text areas. [0024] Machine learning module 40 combines text region map 110, text center line map 112, the linkage relationship and the classifications for each of the text segments 106a4n, 1061¾ h , 106c4n, and 106d4n during an inference stage. [0025] Looking at Figure 3a, there is shown raw input image 100 combined with text bounding box 114a,b,c, in which all text segments in the image without group; and Figure 3b shows a grouped text image 116 with text bounding box 114a,b,c, in which the text segments are classified (grouped) as different text instances. During an inference stage, a plurality of text maps comprising the predicted Text Region Map (TR) and the predicted Text Center Line (TCL) maps are combined, including the linkage relationship and node types, thereby minimizing false positive text areas. A contour of the connected snake-shaped text segments is approximated by contour inference module 30 to obtain a location of the text based on a contour of the connected snake-shaped text segments. Contour inference module 30 addresses the complex route-finding problem that involves a large number of text segments. A Bidirectional long short term memory (Bi-LSTM) with attention mechanism is used to recognize the text according to the text location. In Figure 3c, there is shown positional coordinates of the recognized text blocks, and recognized text characters within the recognized blocks. Figure 4a shows recognized text blocks within an engineering drawing; Figure 4b shows recognized symbols within an engineering drawing; and Figure 4c shows detected elements within an engineering drawing. [0026] Turning to Figure 5, there is shown flowchart 200 with exemplary steps for recognizing text from an image acquired from an image source. In step 202, a raw image 100 is acquired from an image source, such as a 3D scan, a video or a photograph. When raw image 100 is acquired from a 3D scan, raw image 100 is flattened into a single layer image, when raw image 100 is acquired from a video, raw image 14 is extracted as a frame of the video. Raw image 14 may be resized, manipulated, or augmented with any number of effects to generate input data 14. Figure 6a shows a pre-processing workflow associated with step 202.

[0027] Next, in step 204, a Graph Convolutional Neural Network (GCN) and a feature pyramid network (FPN) are used to determine on the image a text area having text and extract at least one visual feature from the input image 14. In one example, a VGG16+FPN is employed for predicting one or more Text Region (TR) Maps, and one or more Text Center Line (TCL) Map, as well as the geometric feature for each text segment, in step 206. In addition to the TR Maps and the TCL Maps , a Height map, and a Angle map may also be generated. For example, the Width map may be automatically obtained by adding a clip function on 1/3 c Heightmap, and the width is kept between 2-6 pixels.

[0028] In step 208, the TR map is used to generate small rectangles on the pixels of the text segments, resulting in dense overlapping segments, as shown see Figure 2d. In one example, each text segment is represented as a rectangle with a geometric feature (x, y, h, w, Q ), where x, y are the coordinates of the center points, h is the height, w is the width and Q is the rotation angle of the rectangle. In one example, a Width map is automatically obtained by adding a clip function on 1 3 x Heightmap, and maintaining the width between 2-6 pixels from the Height and the Angle map

[0029] Next, in step 210 a Non Maximum Suppression (NMS) algorithm is applied on the TCL map is used to obtain small text segments. The extent of the overlap between any two segments is measured by Intersection over Union (IoU) having a predetermined threshold. The NMS algorithm with the IoU allows for the selection of one text segment out of many overlapping segments, such that any duplicate bounding boxes may be removed. Figure 6b shows a detection workflow associated with steps 204-210.

[0030] In step 212, the different types of each text segment of the dense, overlapping text segments is determined and categorized. As an example, the categories may include char segments, interval segments, and non-text segments. [0031] In step 214, the type annotation of the text segments is determined, such that the GCN layer is able to classify the text segments without bounding boxes using weakly supervised training methods. These annotations may be obtained by a synthetic dataset with character level annotation, as character maps and their interval space can be used for annotating char segments and interval segments.

[0032] After obtaining the char segments, interval segments and non-text segments are annotated for the synthetic dataset, the GCN layer is trained to classify the type of the text segments, step 216. In step 218, a determination is made as to whether the dataset has char-level annotation or does not have char-level annotation. When the dataset does not have char-level annotation, a pretrained model is employed on the synthetic dataset to predict the types of the text segments, step 220. The prediction takes into account the visual feature and the geometric feature of each text segment; and the feature expression of each text segment is integrated for the construction of the graph structure between them, and may be based on the distance between two text segments, pivot points and predetermined nodes on the text segments.

[0033] Next, the accuracy of classifying non-text segments and interval segments is considered, step 222. If there are no non-text segments that are wrongly labelled, the annotation result is considered as being correct, otherwise the annotation result is not added to the training set and the weakly-supervised training continues (step 216). A loss function guides the training of GCN, FPN as well as dense overlapped text segments, such that if the loss is large, then text detection model will be inaccurate. The level to which the labelling is correct may be verified via the TR map. In step 224, the instances in which the annotation results are correct are added to the training set by an iterative approach. Next, feature aggregation is performed by the GCN and link predictions are employed to combine text segments that have a strong relationship and weakly supervised node classification results are used to suppress false positive text areas that inherited from a previous a feature pyramid network (FPN) layer, and the graph network minimizes error accumulation. Figure 6c shows a recognition workflow associated with steps 212-224.

[0034] Looking at Figure 1, computing environment 20 may be coupled to a network 90, having other computing devices 92, 94 coupled thereto.

[0035] The described illustration is only one possible implementation of the described subject matter and is not intended to limit the disclosure to the single described implementation. Those of ordinary skill in the art will appreciate the fact that the described components can be connected, combined, and/or used in alternative ways, consistent with this disclosure.

[0036] The network 90 facilitates communication between the computing device 20 and other components, for example, components that obtain observed data for a location and transmit the observed data to the computing device 20. The network 90 can be a wireless or a wireline network. The network 90 can also be a memory pipe, a hardware connection, or any internal or external communication paths between the components.

[0037] The computing device 20 includes a computing system configured to perform the method as described herein. In some cases, the algorithm of the method can be implemented in an executable computing code, e.g., C/C++ executable codes. In some cases, the computing device 20 can include mobile or personal computers that have sufficient memory size to process data.

[0038] The computing device 20 may comprise a computer that includes an input device, such as a keypad, keyboard, touch screen, microphone, speech recognition device, other devices that can accept user information, and/or an output device that conveys information associated with the operation of the computing device 20, including digital data, visual and/or audio information, or a GUI.

[0039] The computing device 20 can serve as a client, network component, a server, a database, or other persistency, and/or any other component of the system 10. In some implementations, one or more components of the computing device 20 may be configured to operate within a cloud-computing-based environment.

[0040] At a high level, the computing device 20 is an electronic computing device operable to receive, transmit, process, store, or manage data and information associated with the system 10. According to some implementations, the computing device 20 may also include, or be communicably coupled with, an application server, e-mail server, web server, caching server, streaming data server, business intelligence (BI) server, and/or other server.

[0041] The computing device 20 can receive requests over network 90 from a client application (e.g., executing on another computing device 20) and respond to the received requests by processing said requests in an appropriate software application. In addition, requests may also be sent to the computing device 20 from internal users (e.g., from a command console or by another appropriate access method), external or third parties, other automated applications, as well as any other appropriate entities, individuals, systems, or computers.

[0042] Computing device 20 includes an interface used according to particular needs, desires, or particular implementations of the computing device 20 and/or system 10. The interface is used by computing device 20 for communicating with other systems in a distributed environment, including within the system 10, connected to the network 90 (whether illustrated or not). Generally, the interface comprises logic encoded in software and/or hardware in a suitable combination and operable to communicate with the network 90. More specifically, the interface may comprise software supporting one or more communication protocols associated with communications such that the network 90 or interface's hardware is operable to communicate physical signals within and outside of the illustrated system 10.

[0043] Although single processor 30 is illustrated in Figure 1, two or more processors may be used according to particular needs, desires, or particular implementations of the computing device 20 and/or the system 10. Generally, processor 30 executes instructions and manipulates data to perform the operations of the computing device 20.

[0044] Memory 80 that holds data for computing device 20 and/or other components of the system 10. Although illustrated as a single memory 80 in Figure 1, two or more memories may be used according to particular needs, desires, or particular implementations of the computing device 20 and/or the system 10. While memory 80 is illustrated as an integral component of the computing device 20, in alternative implementations, memory 80 can be external to the computing device 20 and/or the system 10.

[0045] An application in memory 80 comprises an algorithmic software engine providing functionality according to particular needs, desires, or particular implementations of the computing device 20 and/or the system 10, particularly with respect to functionality required for processing data. In addition, although illustrated as integral to the computing device 20, in alternative implementations, the application can be external to the computing device 20 and/or the system 10.

[0046] There may be any number of computers associated with, or external to, the system 10 and communicating over network 90. Further, the terms “client,” “user,” and other appropriate terminology may be used interchangeably, as appropriate, without departing from the scope of this disclosure. Moreover, this disclosure contemplates that many users may use one computing device 20, or that one user may use multiple computing device 20.

[0047] Implementations of the subject matter and the functional operations described in this specification can be implemented in digital electronic circuitry, in tangibly embodied computer software or firmware, in computer hardware, including the structures disclosed in this specification and their structural equivalents, or in combinations of one or more of them. Implementations of the subj ect matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non- transitory computer-storage medium for execution by, or to control the operation of, data processing apparatus. Alternatively or in addition, the program instructions can be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal that is generated to encode information for transmission to suitable receiver apparatus for execution by a data processing apparatus. The computer-storage medium can be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

[0048] The terms “data processing apparatus,” “computer,” or “electronic computer device” (or equivalent as understood by one of ordinary skill in the art) refer to data processing hardware and encompass all kinds of apparatus, devices, and machines for processing data, including by way of example, a programmable processor, a computer, or multiple processors or computers. The apparatus can also be, or further include, special purpose logic circuitry, e.g., a central processing unit (CPU), a FPGA (field programmable gate array), or an ASIC (application-specific integrated circuit). In some implementations, the data processing apparatus and/or special purpose logic circuitry may be hardware-based and/or software-based. The apparatus can optionally include code that creates an execution environment for computer programs, e.g., code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of them. The present disclosure contemplates the use of data processing apparatuses with or without conventional operating systems, for example LINUX, UNIX, WINDOWS, MAC OS, ANDROID, IOS, or any other suitable conventional operating system.

[0049] A computer program, which may also be referred to or described as a program, software, a software application, a module, a software module, a script, or code can be written in any form of programming language, including compiled or interpreted languages, or declarative or procedural languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a computing environment. A computer program may, but need not, correspond to a file in a file system. A program can be stored in a portion of a file that holds other programs or data, e.g., one or more scripts stored in a markup language document, in a single file dedicated to the program in question, or in multiple coordinated files, e.g., files that store one or more modules, sub-programs, or portions of code. A computer program can be deployed to be executed on one computer or on multiple computers that are located at one site or distributed across multiple sites and interconnected by a communication network. While portions of the programs illustrated in the various figures are shown as individual modules that implement the various features and functionality through various objects, methods, or other processes, the programs may instead include a number of sub-modules, third-party services, components, libraries, and such, as appropriate. Conversely, the features and functionality of various components can be combined into single components, as appropriate.

[0050] The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., a CPU, an FPGA, or an ASIC. [0051] Computers suitable for the execution of a computer program can be based on general or special purpose microprocessors, both, or any other kind of CPU. Generally, a CPU will receive instructions and data from a read-only memory (ROM) or a random access memory (RAM) or both. The essential elements of a computer are a CPU for performing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to, receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer need not have such devices. Moreover, a computer can be embedded in another device, e.g., a mobile telephone, a personal digital assistant (PDA).

[0052] Computer-readable media (transitory or non-transitory, as appropriate) suitable for storing computer program instructions and data include all forms of non volatile memory, media and memory devices, including by way of example semiconductor memory devices, e.g., erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), and flash memory devices; magnetic disks, e.g., internal hard disks or removable disks; magneto-optical disks; and CD-ROM, DVD+/-R, DVD-RAM, and DVD-ROM disks. The memory may store various objects or data, including caches, classes, frameworks, applications, backup data, jobs, web pages, web page templates, database tables, repositories storing business and/or dynamic information, and any other appropriate information including any parameters, variables, algorithms, instructions, rules, constraints, or references thereto. Additionally, the memory may include any other appropriate data, such as logs, policies, security or access data, reporting fries, as well as others. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

[0053] To provide for interaction with a user, implementations of the subject matter described in this specification can be implemented on a computer having a display device, e.g., LCD (liquid crystal display), LED (Light Emitting Diode), or plasma monitor, for displaying information to the user and a keyboard and a pointing device, e.g., a mouse, trackball, or trackpad by which the user can provide input to the computer. Input may also be provided to the computer using a touchscreen, such as a tablet computer surface with pressure sensitivity, a multi-touch screen using capacitive or electric sensing, or other type of touchscreen. Other kinds of devices can be used to provide for interaction with a user as well; for example, feedback provided to the user can be any form of sensory feedback, e.g., visual feedback, auditory feedback, or tactile feedback; and input from the user can be received in any form, including acoustic, speech, or tactile input. In addition, a computer can interact with a user by sending documents to and receiving documents from a device that is used by the user; for example, by sending web pages to a web browser on a user's client device in response to requests received from the web browser.

[0054] The term “graphical user interface,” or “GUI,” may be used in the singular or the plural to describe one or more graphical user interfaces and each of the displays of a particular graphical user interface. Therefore, a GUI may represent any graphical user interface, including but not limited to, a web browser, a touch screen, or a command line interface (CLI) that processes information and efficiently presents the information results to the user. In general, a GUI may include a plurality of user interface (UI) elements, some or all associated with a web browser, such as interactive fields, pull-down lists, and buttons operable by the business suite user. These and other UI elements may be related to or represent the functions of the web browser. [0055] Implementations of the subject matter described in this specification can be implemented in a computing system that includes a back-end component, e.g., as a data server, or that includes a middleware component, e.g., an application server, or that includes a front-end component, e.g., a client computer having a graphical user interface or a Web browser through which a user can interact with an implementation of the subject matter described in this specification, or any combination of one or more such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of wireline and/or wireless digital data communication, e.g., a communication network. Examples of communication networks include a local area network (LAN), a radio access network (RAN), a metropolitan area network (MAN), a wide area network (WAN), Worldwide Interoperability for Microwave Access (WIMAX), a wireless local area network (WLAN) using, for example, 802.11 a/b/g/n and/or 802.20, all or a portion of the Internet, and/or any other communication system or systems at one or more locations. The network may communicate with, for example, Internet Protocol (IP) packets, Frame Relay frames, Asynchronous Transfer Mode (ATM) cells, voice, video, data, and/or other suitable information between network addresses. [0056] The computing system can include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

[0057] While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or on the scope of what may be claimed, but rather as descriptions of features that may be specific to particular implementations of particular inventions. Certain features that are described in this specification in the context of separate implementations can also be implemented in combination in a single implementation. Conversely, various features that are described in the context of a single implementation can also be implemented in multiple implementations separately or in any suitable sub combination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a sub-combination or variation of a sub-combination. [0058] Particular implementations of the subject matter have been described.

Other implementations, alterations, and permutations of the described implementations are within the scope of the following claims as will be apparent to those skilled in the art. While operations are depicted in the drawings or claims in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed (some operations may be considered optional), to achieve desirable results. In certain circumstances, multitasking and parallel processing may be advantageous.

[0059] Moreover, the separation and/or integration of various system modules and components in the implementations described above should not be understood as requiring such separation and/or integration in all implementations, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

[0060] Accordingly, the above description of example implementations does not define or constrain this disclosure. Other changes, substitutions, and alterations are also possible without departing from the spirit and scope of this disclosure.

[0061] The benefits and advantages described above may relate to one embodiment or may relate to several embodiments. The embodiments are not limited to those that solve any or all of the stated problems or those that have any or all of the stated benefits and advantages. The operations of the methods described herein may be carried out in any suitable order, or simultaneously where appropriate. Additionally, individual blocks may be added or deleted from any of the methods without departing from the spirit and scope of the subject matter described herein. Aspects of any of the examples described above may be combined with aspects of any of the other examples described to form further examples without losing the effect sought.

[0062] Benefits, other advantages, and solutions to problems have been described above with regard to specific embodiments. However, the benefits, advantages, solutions to problems, and any element(s) that may cause any benefit, advantage, or solution to occur or become more pronounced are not to be construed as critical, required, or essential features or elements of any or all the claims. As used herein, the terms "comprises," "comprising," or any other variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Further, no element described herein is required for the practice of the invention unless expressly described as "essential" or "critical."

[0063] The preceding detailed description of exemplary embodiments of the invention makes reference to the accompanying drawings, which show the exemplary embodiment by way of illustration. While these exemplary embodiments are described in sufficient detail to enable those skilled in the art to practice the invention, it should be understood that other embodiments may be realized and that logical and mechanical changes may be made without departing from the spirit and scope of the invention. For example, the steps recited in any of the method or process claims may be executed in any order and are not limited to the order presented. Thus, the preceding detailed description is presented for purposes of illustration only and not of limitation, and the scope of the invention is defined by the preceding description, and with respect to the attached claims.