Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
TRANSFORMER-BASED SCENE TEXT DETECTION
Document Type and Number:
WIPO Patent Application WO/2022/099325
Kind Code:
A1
Abstract:
A system may include a backbone network configured to generate feature maps from an image, a transformer network coupled to the backbone network, and a scene text detection subsystem, the scene text detection subsystem comprising a processor, and a non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to generate a plurality of image tokens from one or more feature maps of an input image, and generate, via the transformer encoder, a set of token queries, wherein the set of token queries quantify an attention of a respective textual feature of a respective image token of the sequence of image tokens relative to all other respective textual features of all other image tokens, and generate, via the transformer decoder, a set of predicted text boxes of the input image.

Inventors:
LI JIACHEN (US)
ZHANG KAIYU (US)
LIU RONGRONG (US)
LIN YUAN (US)
Application Number:
PCT/US2022/011790
Publication Date:
May 12, 2022
Filing Date:
January 10, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
INNOPEAK TECH INC (US)
International Classes:
G06K9/62; G06K9/00; G06T7/00
Foreign References:
US20210271705A12021-09-02
US20210110189A12021-04-15
US20180005082A12018-01-04
US20210150249A12021-05-20
Attorney, Agent or Firm:
BRATSCHUN, Thomas D. et al. (US)
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1. A method comprising: generating, via a computer system, a plurality of image tokens from one or more feature maps, wherein the plurality of image tokens is concatenated into a sequence of image tokens, wherein each image token of the plurality of image tokens is a sampled vector representation of a respective region of the one or more feature maps, wherein each image token of the plurality of image tokens corresponds to a respective textual feature of the input image; generating, via a transformer encoder, a set of token queries, wherein the set of token queries are an encoder output of the transformer encoder comprising the plurality of image tokens encoded with information quantifying an attention of a respective textual feature of a respective image token of the sequence of image tokens relative to all other respective textual features of all other image tokens in the sequence of image tokens; and generating, via the transformer decoder, a set of predicted text boxes of the input image based on the set of token queries, wherein the set of predicted text boxes of an image are a decoder output of the transformer decoder comprising a set of one or more vectors representing the coordinates of the input image defining one or more predicted text boxes, wherein the one or more predicted text boxes are text boxes predicted to contain text in the input image.

2. The method of claim 1, wherein generating the encoder output further comprises: generating, via the first encoder layer, a plurality of query metrics, a plurality of key metrics, and plurality of value metrics from the plurality of image tokens, wherein the plurality of query metrics, plurality of key metrics, and plurality of value metrics are vectors generated by respective linear transforms; and applying, via the first encoder layer, self- attention to the plurality of query metrics, the plurality of key metrics, and the plurality of value metrics.

3. The method of claim 2, wherein generating the encoder output further comprises:

33 applying, via one or more subsequent encoder layers, self- attention to a preceding output of a preceding encoder layer, wherein applying self-attention to the outputs of the preceding encoder layer comprises: generating, via a subsequent encoder layer of the one or more subsequent encoder layers, a plurality of query vectors, a plurality of key vectors, and a plurality of value vectors from the preceding output of the preceding layer; and applying, via the subsequent encoder layer, self- attention to the plurality of query vectors, the plurality of key vectors, and the plurality of value vectors.

4. The method of claim 2, wherein applying, via the first encoder layer, selfattention to the plurality of query metrics, the plurality of key metrics, and the plurality of value metrics further comprises: dividing, at the first encoder layer, the plurality of query metrics, the plurality of key metrics, and plurality of value metrics respectively into an M-number of pluralities of query metric sub-vectors, an M-number of pluralities of key metric sub-vectors, and an M-number of pluralities of value metric sub- vectors, wherein M is an integer corresponding to a number of attention heads in a first multi-headed attention block of the first encoder layer; and applying, via a first attention head of a first multi-headed attention block of the first encoder layer, self- attention to a first plurality of query metric sub-vectors, a first plurality of key metric sub- vectors, and a first plurality of value metric sub- vectors.

5. The method of claim 2, wherein applying, via the first encoder layer, selfattention to the plurality of query metrics, the plurality of key metrics, and the plurality of value metrics further comprises: assigning, based on a deformable filter, a respective set of key metrics of the plurality of key metrics comprising a fixed number of keys metrics to each query metric of the plurality of query metrics; and wherein applying self-attention further comprises producing a score matrix between each query metric and its respective set of key metrics.

34

6. The method of claim 1, wherein generating the decoder output further comprises: generating, via the first decoder layer, a plurality of query queries, a plurality of key queries from the set of token queries, and plurality of value queries from the set of text queries, wherein the plurality of query queries, plurality of key queries, and plurality of value queries are vectors generated by respective linear transforms; and applying, via the first decoder layer, self- attention to the plurality of query queries, the plurality of key queries, and the plurality of value queries.

7. The method of claim 6, wherein applying, via the first decoder layer, selfattention to the plurality of query queries, the plurality of key queries, and the plurality of value queries further comprises: dividing, at the first decoder layer, the plurality of query queries, the plurality of key queries, and plurality of value queries respectively into an M-number of pluralities of query query sub-vectors, an M-number of pluralities of key query sub-vectors, and an M- number of pluralities of value query sub-vectors, wherein M is an integer corresponding to a number of attention heads in a first multi-headed attention block of the first decoder layer; and applying, via a first attention head of a first multi-headed attention block of the first decoder layer, self- attention to a first plurality of query query sub- vectors, a first plurality of key query sub-vectors, and a first plurality of value query sub-vectors.

8. The method of claim 6, wherein applying, via the first decoder layer, selfattention to the plurality of query queries, the plurality of key queries, and the plurality of value queries further comprises: determining, based on a deformable filter, a set of value queries of the plurality of value queries comprising a fixed number of value queries, wherein each value query of the set of value queries is selected to correspond to locations of the one or more feature map exhibiting prominent features over other pixels; and wherein applying self-attention further comprises multiplying the attention weights, generated based on the plurality of query queries and the plurality of key queries, with the set of value queries.

9. The method of claim 1, further comprising: enforcing, via the computer system, unique matching between each of one or more ground-truth text boxes of the input image and the set of predicted text boxes of the input image, based on cost-based matching algorithm, wherein each ground-truth text box is uniquely matched to one corresponding predicted text box of the set of predicted text boxes.

10. An apparatus, comprising: a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to: generate a plurality of image tokens from one or more feature maps, wherein the plurality of image tokens is concatenated into a sequence of image tokens, wherein each image token of the plurality of image tokens is a sampled vector representation of a respective region of the one or more feature maps, wherein each image token of the plurality of image tokens corresponds to a respective textual feature of an input image; obtain, via a transformer encoder, the plurality of image tokens as an encoder input vector to a first encoder layer; generate, via a transformer encoder, a set of token queries, wherein the set of token queries are an encoder output of the transformer encoder comprising the plurality of image tokens encoded with information quantifying an attention of a respective textual feature of a respective image token of the sequence of image tokens relative to all other respective textual features of all other image tokens in the sequence of image tokens; obtain, via a transformer decoder, the set of token queries and a set of text queries as a decoder input vector to a first decoder layer, wherein the set of text queries is a set of one or more vectors representing coordinates of the input image defining one or more respective text boxes; and generate, via the transformer decoder, a set of predicted text boxes of the input image based on the set of token queries, wherein the set of predicted text boxes of an image are a decoder output of the transformer decoder comprising a set of one or more vectors representing the coordinates of the input image defining one or more predicted text boxes, wherein the one or more predicted text boxes are text boxes predicted to contain text in the input image.

11. The apparatus of claim 10, wherein generating the encoder output further comprises: generating, via the first encoder layer, a plurality of query metrics, a plurality of key metrics, and plurality of value metrics from the plurality of image tokens, wherein the plurality of query metrics, plurality of key metrics, and plurality of value metrics are vectors generated by respective linear transforms; applying, via the first encoder layer, self- attention to the plurality of query metrics, the plurality of key metrics, and the plurality of value metrics, wherein applying self-attention to the plurality of query metrics, the plurality of key metrics, and the plurality of value metrics; wherein generating the decoder output further comprises: generating, via the first decoder layer, a plurality of query queries, a plurality of key queries from the set of token queries, and plurality of value queries from the set of text queries, wherein the plurality of query queries, plurality of key queries, and plurality of value queries are vectors generated by respective linear transforms; and applying, via the first decoder layer, self- attention to the plurality of query queries, the plurality of key queries, and the plurality of value queries.

12. The apparatus of claim 11, wherein the set of instructions is further executable by the processor to:

37 apply, via one or more subsequent encoder layers, self-attention to a preceding output of a preceding encoder layer, wherein applying self-attention to the outputs of the preceding encoder layer comprises: generating, via a subsequent encoder layer of the one or more subsequent encoder layers, a plurality of query vectors, a plurality of key vectors, and a plurality of value vectors from the preceding output of the preceding encoder layer; applying, via the subsequent encoder layer, self-attention to the plurality of query vectors, the plurality of key vectors, and the plurality of value vectors generated from the preceding output of the preceding encoder layer; and apply, via one or more subsequent decoder layers, self-attention to a preceding output of a preceding decoder layer, wherein applying self-attention to the outputs of the preceding decoder layer comprises: generating, via a subsequent decoder layer of the one or more subsequent decoder layers, a plurality of query vectors, a plurality of key vectors, and a plurality of value vectors from the preceding output of the preceding decoder layer; and applying, via the subsequent decoder layer, self-attention to the plurality of query vectors, the plurality of key vectors, and the plurality of value vectors generated from the preceding output of the preceding decoder layer.

13. The apparatus of claim 10, wherein the set of instructions is further executable by the processor to: assign, based on a deformable filter, a respective set of key metrics of the plurality of key metrics comprising a fixed number of keys metrics to each query metric of the plurality of query metrics; and wherein applying self-attention further comprises producing a score matrix between each query metric and its respective set of key metrics.

38

14. The apparatus of claim 10, wherein the set of instructions is further executable by the processor to: determine, based on a deformable filter, a set of value queries of the plurality of value queries comprising a fixed number of value queries, wherein each value query of the set of value queries is selected to correspond to locations of the one or more feature map exhibiting prominent features over other pixels; and wherein applying self-attention further comprises multiplying the attention weights, generated based on the plurality of query queries and the plurality of key queries, with the set of value queries.

15. The apparatus of claim 10, wherein the set of instructions is further executable by the processor to: enforce, via the computer system, unique matching between each of one or more ground-truth text boxes of the input image and the set of predicted text boxes of the input image, based on cost-based matching algorithm, wherein each ground-truth text box is uniquely matched to one corresponding predicted text box of the set of predicted text boxes.

16. A system for transformer-based scene text detection, the system comprising: a transformer network comprising a transformer encoder and transformer decoder, the transformer encoder comprising one or more encoder layers, and the transformer decoder comprising one or more decoder layer, each of the one or more encoder layers and one or more decoder layers comprising a respective multi-head attention block; a scene text detection subsystem comprising: a processor; and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to: generate a plurality of image tokens from one or more feature maps of an input image, wherein the plurality of image tokens is concatenated into a

39 sequence of image tokens, wherein each image token of the plurality of image tokens is a sampled vector representation of a respective region of the one or more feature maps, wherein each image token of the plurality of image tokens corresponds to a respective textual feature of the input image; generate, via each of the one or more encoder layers, a plurality of query metrics, a plurality of key metrics, and plurality of value metrics from the plurality of image tokens, wherein the plurality of query metrics, plurality of key metrics, and plurality of value metrics are vectors generated by respective linear transforms; generate, via each of the one or more encoder layers, a set of token queries, wherein the set of token queries are an output of each encoder layer of the one or more encoder layers, wherein the set of token queries are the plurality of image tokens encoded with information quantifying an attention of a respective textual feature of a respective image token of the sequence of image tokens relative to all other respective textual features of all other image tokens in the sequence of image tokens, wherein generating the set of token queries comprises applying self-attention to the plurality of query metrics, the plurality of key metrics, and the plurality of value metrics; generate, via each of the one or more decoder layers, a plurality of query queries, a plurality of key queries from the set of token queries, and plurality of value queries from the set of text queries, wherein the plurality of query queries, plurality of key queries, and plurality of value queries are vectors generated by respective linear transforms; and detect the location of text in the input image, wherein detecting the location of the text in the input image comprises generating, via each of the one or more decoder layers, a set of predicted text boxes of the input image based on the set of token queries and a set of text queries, wherein the set of text queries is a set of one or more vectors representing coordinates of the input image defining one or more respective text boxes, wherein the set of predicted text boxes of an image are a decoder output of the transformer decoder comprising a set of one or more vectors representing the coordinates of the input image defining one or more

40 predicted text boxes, wherein the one or more predicted text boxes are text boxes predicted to contain text in the input image, wherein generating the set of predicted text boxes comprises applying self- attention to the plurality of query queries, the plurality of key queries, and the plurality of value queries.

17. The system of claim 16, wherein applying, via each of the one or more encoder layers, self-attention to the plurality of query metrics, the plurality of key metrics, and the plurality of value metrics further comprises: dividing, at each of the one or more encoder layers, the plurality of query metrics, the plurality of key metrics, and plurality of value metrics respectively into an M-number of pluralities of query metric sub-vectors, an M-number of pluralities of key metric subvectors, and an M-number of pluralities of value metric sub-vectors, wherein M is an integer corresponding to a number of attention heads in a first multi-headed attention block of the first encoder layer; applying, via a first attention head of a first multi-headed attention block of each respective encoder layer of the one or more encoder layers, self- attention to a first plurality of query metric sub-vectors, a first plurality of key metric sub-vectors, and a first plurality of value metric sub-vectors; wherein applying, via each of the one or more decoder layers, self-attention to the plurality of query queries, the plurality of key queries, and the plurality of value queries further comprises: dividing, at each decoder layer of the one or more decoder layers, the plurality of query queries, the plurality of key queries, and plurality of value queries respectively into an M-number of pluralities of query query sub-vectors, an M-number of pluralities of key query sub-vectors, and an M-number of pluralities of value query sub-vectors, wherein M corresponds to the number of attention heads in the first multi-headed attention block of the first decoder layer; and applying, via a first attention head of a first multi-headed attention block of each respective decoder layer of the one or more decoder layers, self- attention to a first plurality of query query sub-vectors, a first plurality of key query sub-vectors, and a first plurality of value query sub-vectors.

41

18. The system of claim 16, wherein the set of instructions is further executable by the processor to: apply, via one or more subsequent encoder layers, self-attention to a preceding output of a preceding encoder layer, wherein applying self-attention to the outputs of the preceding encoder layer comprises: generating, via a subsequent encoder layer of the one or more subsequent encoder layers, a plurality of query vectors, a plurality of key vectors, and a plurality of value vectors from the preceding output of the preceding encoder layer; applying, via the subsequent encoder layer, self- attention to the plurality of query vectors, the plurality of key vectors, and the plurality of value vectors generated from the preceding output of the preceding encoder layer; and apply, via one or more subsequent decoder layers, self-attention to a preceding output of a preceding decoder layer, wherein applying self-attention to the outputs of the preceding decoder layer comprises: generating, via a subsequent decoder layer of the one or more subsequent decoder layers, a plurality of query vectors, a plurality of key vectors, and a plurality of value vectors from the preceding output of the preceding decoder layer; and applying, via the subsequent decoder layer, self- attention to the plurality of query vectors, the plurality of key vectors, and the plurality of value vectors generated from the preceding output of the preceding decoder layer.

19. The system of claim 16, wherein the set of instructions is further executable by the processor to: assign, based on an deformable encoder filter, a respective set of key metrics of the plurality of key metrics comprising a fixed number of keys metrics to each query metric of the plurality of query metrics, wherein applying self-attention further comprises producing a score matrix between each query metric and its respective set of key metrics; and determine, based on a deformable decoder filter, a set of value queries of the plurality of value queries comprising a fixed number of value queries, wherein each value

42 query of the set of value queries is selected to correspond to locations of the one or more feature map exhibiting prominent features over other pixels, wherein applying self-attention further comprises multiplying the attention weights, generated based on the plurality of query queries and the plurality of key queries, with the set of value queries.

20. The system of claim 16, wherein the set of instructions is further executable by the processor to: enforce, via the computer system, unique matching between each of one or more ground-truth text boxes of the input image and the set of predicted text boxes of the input image, based on cost-based matching algorithm, wherein each ground-truth text box is uniquely matched to one corresponding predicted text box of the set of predicted text boxes.

43

Description:
TRANSFORMER-BASED SCENE TEXT DETECTION

COPYRIGHT STATEMENT

[0001] A portion of the disclosure of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all copyright rights whatsoever.

FIELD

[0002] The present disclosure relates, in general, to methods, systems, and apparatuses for computer vision and scene text detection using computer vision.

BACKGROUND

[0003] Scene text detection is a type of text detection technique, which aims to detect instances of text in different scenes of an image. This is a challenging task as instances of scene text can be in different languages, colors, fonts, sizes, orientations, and shapes. Current solutions rely on convolutional-neural-network (CNN) based models, which use large, consecutive convolution layers to predict locations of text instances. CNN based models require many inductive biases, and do not generalize efficiently to different datasets. Often, training of CNN models requires anchor boxes to be manually designed for accuracy and efficiency. Moreover, the statistics of text instances and training schedules need to be adapted to each different data set.

[0004] Thus, methods, systems, and apparatuses for transformer-based scene text detection are provided.

SUMMARY

[0005] Novel tools and techniques for transformer-based scene text detection are provided.

[0006] A method may include generating, via a computer system, a plurality of image tokens from one or more feature maps, wherein the plurality of image tokens is concatenated into a sequence of image tokens, wherein each image token of the plurality of image tokens is a sampled vector representation of a respective region of the one or more feature maps, wherein each image token of the plurality of image tokens corresponds to a respective textual feature of the input image. The method includes generating, via a transformer encoder, a set of token queries, wherein the set of token queries are an encoder output of the transformer encoder comprising the plurality of image tokens encoded with information quantifying an attention of a respective textual feature of a respective image token of the sequence of image tokens relative to all other respective textual features of all other image tokens in the sequence of image tokens. The method further includes generating, via the transformer decoder, a set of predicted text boxes of the input image based on the set of token queries, wherein the set of predicted text boxes of an image are a decoder output of the transformer decoder comprising a set of one or more vectors representing the coordinates of the input image defining one or more predicted text boxes, wherein the one or more predicted text boxes are text boxes predicted to contain text in the input image.

[0007] An apparatus may include a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to perform various functions. The set of instructions may be executed by the processor to generate a plurality of image tokens from one or more feature maps, wherein the plurality of image tokens is concatenated into a sequence of image tokens, wherein each image token of the plurality of image tokens is a sampled vector representation of a respective region of the one or more feature maps, wherein each image token of the plurality of image tokens corresponds to a respective textual feature of an input image, and obtain, via a transformer encoder, the plurality of image tokens as an encoder input vector to a first encoder layer. The set of instructions may further be executed by the processor to generate, via a transformer encoder, a set of token queries, wherein the set of token queries are an encoder output of the transformer encoder comprising the plurality of image tokens encoded with information quantifying an attention of a respective textual feature of a respective image token of the sequence of image tokens relative to all other respective textual features of all other image tokens in the sequence of image tokens. The set of instructions may further be executed by the processor to obtain, via a transformer decoder, the set of token queries and a set of text queries as a decoder input vector to a first decoder layer, wherein the set of text queries is a set of one or more vectors representing coordinates of the input image defining one or more respective text boxes, and generate, via the transformer decoder, a set of predicted text boxes of the input image based on the set of token queries, wherein the set of predicted text boxes of an image are a decoder output of the transformer decoder comprising a set of one or more vectors representing the coordinates of the input image defining one or more predicted text boxes, wherein the one or more predicted text boxes are text boxes predicted to contain text in the input image.

[0008] A system may include a transformer network comprising a transformer encoder and transformer decoder, the transformer encoder comprising one or more encoder layers, and the transformer decoder comprising one or more decoder layer, each of the one or more encoder layers and one or more decoder layers comprising a respective multi-head attention block, and a scene text detection subsystem. The scene text detection subsystem may further include a processor, and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to generate a plurality of image tokens from one or more feature maps of an input image, wherein the plurality of image tokens is concatenated into a sequence of image tokens, wherein each image token of the plurality of image tokens is a sampled vector representation of a respective region of the one or more feature maps, wherein each image token of the plurality of image tokens corresponds to a respective textual feature of the input image, and to generate, via each of the one or more encoder layers, a plurality of query metrics, a plurality of key metrics, and plurality of value metrics from the plurality of image tokens, wherein the plurality of query metrics, plurality of key metrics, and plurality of value metrics are vectors generated by respective linear transforms. The instructions may be further executable by the processor to generate, via each of the one or more encoder layers, a set of token queries, wherein the set of token queries are an output of each encoder layer of the one or more encoder layers, wherein the set of token queries are the plurality of image tokens encoded with information quantifying an attention of a respective textual feature of a respective image token of the sequence of image tokens relative to all other respective textual features of all other image tokens in the sequence of image tokens, wherein generating the set of token queries comprises applying self-attention to the plurality of query metrics, the plurality of key metrics, and the plurality of value metrics. The instructions may be further executable by the processor to generate, via each of the one or more decoder layers, a plurality of query queries, a plurality of key queries from the set of token queries, and plurality of value queries from the set of text queries, wherein the plurality of query queries, plurality of key queries, and plurality of value queries are vectors generated by respective linear transforms, and to detect the location of text in the input image, wherein detecting the location of the text in the input image comprises generating, via each of the one or more decoder layers, a set of predicted text boxes of the input image based on the set of token queries and a set of text queries, wherein the set of text queries is a set of one or more vectors representing coordinates of the input image defining one or more respective text boxes, wherein the set of predicted text boxes of an image are a decoder output of the transformer decoder comprising a set of one or more vectors representing the coordinates of the input image defining one or more predicted text boxes, wherein the one or more predicted text boxes are text boxes predicted to contain text in the input image, wherein generating the set of predicted text boxes comprises applying self-attention to the plurality of query queries, the plurality of key queries, and the plurality of value queries.

[0009] These illustrative embodiments are mentioned not to limit or define the disclosure, but to provide examples to aid understanding thereof. Additional embodiments are discussed in the Detailed Description, and further description is provided therein.

BRIEF DESCRIPTION OF THE DRAWINGS

[0010] A further understanding of the nature and advantages of particular embodiments may be realized by reference to the remaining portions of the specification and the drawings, in which like reference numerals are used to refer to similar components. In some instances, a sub-label is associated with a reference numeral to denote one of multiple similar components. When reference is made to a reference numeral without specification to an existing sub-label, it is intended to refer to all such multiple similar components.

[0011] Fig. 1 is a schematic block diagram of a system for transformer-based scene text detection, in accordance with various embodiments;

[0012] Fig. 2 is a schematic block diagram of a transformer model architecture for scene text detection, in accordance with various embodiments;

[0013] Fig. 3 is a sequence diagram of a system for transformer-based scene text detection, in accordance with various embodiments;

[0014] Fig. 4 is a flow diagram of a method for transformer-based scene text detection, in accordance with various embodiments;

[0015] Fig. 5 is a schematic block diagram of a computer system for providing transformer-based scene text detection, in accordance with various embodiments. DETAILED DESCRIPTION OF EMBODIMENTS

[0016] Various embodiments provide tools and techniques for transformer-based scene text detection.

[0017] In some examples, a method for transformer-based scene text detection is provided. A method may include generating, via a computer system, a plurality of image tokens from one or more feature maps, wherein the plurality of image tokens is concatenated into a sequence of image tokens, wherein each image token of the plurality of image tokens is a sampled vector representation of a respective region of the one or more feature maps, wherein each image token of the plurality of image tokens corresponds to a respective textual feature of the input image. The method includes generating, via a transformer encoder, a set of token queries, wherein the set of token queries are an encoder output of the transformer encoder comprising the plurality of image tokens encoded with information quantifying an attention of a respective textual feature of a respective image token of the sequence of image tokens relative to all other respective textual features of all other image tokens in the sequence of image tokens. The method further includes generating, via the transformer decoder, a set of predicted text boxes of the input image based on the set of token queries, wherein the set of predicted text boxes of an image are a decoder output of the transformer decoder comprising a set of one or more vectors representing the coordinates of the input image defining one or more predicted text boxes, wherein the one or more predicted text boxes are text boxes predicted to contain text in the input image.

[0018] In some examples, generating the encoder output may further include comprises generating, via the first encoder layer, a plurality of query metrics, a plurality of key metrics, and plurality of value metrics from the plurality of image tokens, wherein the plurality of query metrics, plurality of key metrics, and plurality of value metrics are vectors generated by respective linear transforms, and applying, via the first encoder layer, selfattention to the plurality of query metrics, the plurality of key metrics, and the plurality of value metrics. Applying self-attention to the plurality of query metrics, the plurality of key metrics, and the plurality of value metrics may further include dividing, at the first encoder layer, the plurality of query metrics, the plurality of key metrics, and plurality of value metrics respectively into an M-number of pluralities of query metric sub-vectors, an M- number of pluralities of key metric sub-vectors, and an M-number of pluralities of value metric sub-vectors, wherein M is an integer corresponding to a number of attention heads in a first multi-headed attention block of the first encoder layer, and applying, via a first attention head of a first multi-headed attention block of the first encoder layer, self-attention to a first plurality of query metric sub-vectors, a first plurality of key metric sub-vectors, and a first plurality of value metric sub-vectors.

[0019] In some examples, generating the encoder output may further include applying, via one or more subsequent encoder layers, self- attention to a preceding output of a preceding encoder layer. Applying self-attention to the outputs of the preceding encoder layer may include generating, via a subsequent encoder layer of the one or more subsequent encoder layers, a plurality of query vectors, a plurality of key vectors, and a plurality of value vectors from the preceding output of the preceding layer, and applying, via the subsequent encoder layer, self-attention to the plurality of query vectors, the plurality of key vectors, and the plurality of value vectors.

[0020] In further examples, generating the decoder output may further include applying, via one or more subsequent decoder layers, self- attention to a preceding output of a preceding decoder layer. Applying self-attention to the outputs of the preceding decoder layer may further include generating, via a subsequent decoder layer of the one or more subsequent decoder layers, a plurality of query vectors, a plurality of key vectors, and a plurality of value vectors from the preceding output of the preceding layer, and applying, via the subsequent decoder layer, self-attention to the plurality of query vectors, the plurality of key vectors, and the plurality of value vectors.

[0021] In further examples, applying self-attention to the plurality of query metrics, the plurality of key metrics, and the plurality of value metrics may further include assigning, based on a deformable filter, a respective set of key metrics of the plurality of key metrics comprising a fixed number of keys metrics to each query metric of the plurality of query metrics, and applying self-attention may further include producing a score matrix between each query metric and its respective set of key metrics.

[0022] In yet further examples, generating the decoder output further includes generating, via the first decoder layer, a plurality of query queries, a plurality of key queries from the set of token queries, and plurality of value queries from the set of text queries, wherein the plurality of query queries, plurality of key queries, and plurality of value queries are vectors generated by respective linear transforms, and applying, via the first decoder layer, self-attention to the plurality of query queries, the plurality of key queries, and the plurality of value queries. Applying self-attention to the plurality of query queries, the plurality of key queries, and the plurality of value queries may further include determining, based on a deformable filter, a set of value queries of the plurality of value queries comprising a fixed number of value queries, wherein each value query of the set of value queries is selected to correspond to locations of the one or more feature map exhibiting prominent features over other pixels. Applying self-attention may further include multiplying the attention weights, generated based on the plurality of query queries and the plurality of key queries, with the set of value queries.

[0023] In some examples, applying self-attention to the plurality of query queries, the plurality of key queries, and the plurality of value queries may further include dividing, at the first decoder layer, the plurality of query queries, the plurality of key queries, and plurality of value queries respectively into an M-number of pluralities of query query subvectors, an M-number of pluralities of key query sub- vectors, and an M-number of pluralities of value query sub-vectors, wherein M is an integer corresponding to a number of attention heads in a first multi-headed attention block of the first decoder layer, and applying, via a first attention head of a first multi-headed attention block of the first decoder layer, self-attention to a first plurality of query query sub-vectors, a first plurality of key query sub-vectors, and a first plurality of value query sub-vectors.

[0024] In some examples, the method may further include enforcing, via the computer system, unique matching between each of one or more ground-truth text boxes of the input image and the set of predicted text boxes of the input image, based on cost-based matching algorithm, wherein each ground-truth text box is uniquely matched to one corresponding predicted text box of the set of predicted text boxes. In yet further examples, the method may include determining, via the computer system, a bounding box loss between each ground-truth text box and the uniquely matched predicted text box, and optimizing, via the computer system, the bounding box loss by modification of at least one of the backbone network, transformer encoder, or transformer decoder.

[0025] In some embodiments, an apparatus for transformer-based scene text detection is provided. The apparatus may include a processor, and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to perform various functions. The set of instructions may be executed by the processor to generate a plurality of image tokens from one or more feature maps, wherein the plurality of image tokens is concatenated into a sequence of image tokens, wherein each image token of the plurality of image tokens is a sampled vector representation of a respective region of the one or more feature maps, wherein each image token of the plurality of image tokens corresponds to a respective textual feature of an input image, and obtain, via a transformer encoder, the plurality of image tokens as an encoder input vector to a first encoder layer. The set of instructions may further be executed by the processor to generate, via a transformer encoder, a set of token queries, wherein the set of token queries are an encoder output of the transformer encoder comprising the plurality of image tokens encoded with information quantifying an attention of a respective textual feature of a respective image token of the sequence of image tokens relative to all other respective textual features of all other image tokens in the sequence of image tokens. The set of instructions may further be executed by the processor to obtain, via a transformer decoder, the set of token queries and a set of text queries as a decoder input vector to a first decoder layer, wherein the set of text queries is a set of one or more vectors representing coordinates of the input image defining one or more respective text boxes, and generate, via the transformer decoder, a set of predicted text boxes of the input image based on the set of token queries, wherein the set of predicted text boxes of an image are a decoder output of the transformer decoder comprising a set of one or more vectors representing the coordinates of the input image defining one or more predicted text boxes, wherein the one or more predicted text boxes are text boxes predicted to contain text in the input image.

[0026] In some examples, generating the encoder output may further include generating, via the first encoder layer, a plurality of query metrics, a plurality of key metrics, and plurality of value metrics from the plurality of image tokens, wherein the plurality of query metrics, plurality of key metrics, and plurality of value metrics are vectors generated by respective linear transforms, and applying, via the first encoder layer, self-attention to the plurality of query metrics, the plurality of key metrics, and the plurality of value metrics, wherein applying self-attention to the plurality of query metrics, the plurality of key metrics, and the plurality of value metrics. Generating the decoder output may further include generating, via the first decoder layer, a plurality of query queries, a plurality of key queries from the set of token queries, and plurality of value queries from the set of text queries, wherein the plurality of query queries, plurality of key queries, and plurality of value queries are vectors generated by respective linear transforms, and applying, via the first decoder layer, self-attention to the plurality of query queries, the plurality of key queries, and the plurality of value queries.

[0027] In further examples, the set of instructions may further be executable by the processor to apply, via one or more subsequent encoder layers, self-attention to a preceding output of a preceding encoder layer. Applying self-attention to the outputs of the preceding encoder layer may include generating, via a subsequent encoder layer of the one or more subsequent encoder layers, a plurality of query vectors, a plurality of key vectors, and a plurality of value vectors from the preceding output of the preceding layer, applying, via the subsequent encoder layer, self-attention to the plurality of query vectors, the plurality of key vectors, and the plurality of value vectors. The set of instructions may further be executable by the processor to apply, via one or more subsequent decoder layers, self-attention to a preceding output of a preceding decoder layer. Applying self-attention to the outputs of the preceding decoder layer may further include generating, via a subsequent decoder layer of the one or more subsequent decoder layers, a plurality of query vectors, a plurality of key vectors, and a plurality of value vectors from the preceding output of the preceding layer, and applying, via the subsequent decoder layer, self-attention to the plurality of query vectors, the plurality of key vectors, and the plurality of value vectors.

[0028] In yet further examples, the set of instructions may further be executable by the processor to assign, based on a deformable filter, a respective set of key metrics of the plurality of key metrics comprising a fixed number of keys metrics to each query metric of the plurality of query metrics. Applying self- attention may further include producing a score matrix between each query metric and its respective set of key metrics.

[0029] In yet further examples, the set of instructions may further be executable by the processor to, determine, based on a deformable filter, a set of value queries of the plurality of value queries comprising a fixed number of value queries, wherein each value query of the set of value queries is selected to correspond to locations of the one or more feature map exhibiting prominent features over other pixels. Applying self- attention may further include multiplying the attention weights, generated based on the plurality of query queries and the plurality of key queries, with the set of value queries.

[0030] In yet further examples, the set of instructions may further be executable by the processor to enforce, via the computer system, unique matching between each of one or more ground-truth text boxes of the input image and the set of predicted text boxes of the input image, based on cost-based matching algorithm, wherein each ground-truth text box is uniquely matched to one corresponding predicted text box of the set of predicted text boxes.

[0031] In in further embodiments, a system for transformer-based scene text detection is provided. The system may include a transformer network, which includes a transformer encoder and transformer decoder, the transformer encoder including one or more encoder layers, and the transformer decoder including one or more decoder layer, and each of the one or more encoder layers and one or more decoder layers including a respective multi-head attention block, and a scene text detection subsystem. The scene text detection subsystem may further include a processor, and a non-transitory computer readable medium in communication with the processor, the non-transitory computer readable medium having encoded thereon a set of instructions executable by the processor to generate a plurality of image tokens from one or more feature maps of an input image, wherein the plurality of image tokens is concatenated into a sequence of image tokens, wherein each image token of the plurality of image tokens is a sampled vector representation of a respective region of the one or more feature maps, wherein each image token of the plurality of image tokens corresponds to a respective textual feature of the input image. The instructions may further be executed by the processor to generate, via each of the one or more encoder layers, a plurality of query metrics, a plurality of key metrics, and plurality of value metrics from the plurality of image tokens, wherein the plurality of query metrics, plurality of key metrics, and plurality of value metrics are vectors generated by respective linear transforms. The instructions may further be executed by the processor to generate, via each of the one or more encoder layers, a set of token queries, wherein the set of token queries are an output of each encoder layer of the one or more encoder layers, wherein the set of token queries are the plurality of image tokens encoded with information quantifying an attention of a respective textual feature of a respective image token of the sequence of image tokens relative to all other respective textual features of all other image tokens in the sequence of image tokens, wherein generating the set of token queries comprises applying self-attention to the plurality of query metrics, the plurality of key metrics, and the plurality of value metrics, and generate, via each of the one or more decoder layers, a plurality of query queries, a plurality of key queries from the set of token queries, and plurality of value queries from the set of text queries, wherein the plurality of query queries, plurality of key queries, and plurality of value queries are vectors generated by respective linear transforms. The set of instructions may further be executed by the processor to detect the location of text in the input image, wherein detecting the location of the text in the input image comprises generating, via each of the one or more decoder layers, a set of predicted text boxes of the input image based on the set of token queries and a set of text queries, wherein the set of text queries is a set of one or more vectors representing coordinates of the input image defining one or more respective text boxes, wherein the set of predicted text boxes of an image are a decoder output of the transformer decoder comprising a set of one or more vectors representing the coordinates of the input image defining one or more predicted text boxes, wherein the one or more predicted text boxes are text boxes predicted to contain text in the input image, wherein generating the set of predicted text boxes comprises applying selfattention to the plurality of query queries, the plurality of key queries, and the plurality of value queries.

[0032] In some examples, applying, via the first encoder layer, self-attention to the plurality of query metrics, the plurality of key metrics, and the plurality of value metrics further comprises dividing, at each of the one or more encoder layers, the plurality of query metrics, the plurality of key metrics, and plurality of value metrics respectively into an M- number of pluralities of query metric sub-vectors, an M-number of pluralities of key metric sub-vectors, and an M-number of pluralities of value metric sub-vectors, wherein M is an integer corresponding to a number of attention heads in a first multi-headed attention block of each respective encoder layer of the one or more encoder layers, and applying, via a first attention head of a first multi-headed attention block of the first encoder layer, self-attention to a first plurality of query metric sub-vectors, a first plurality of key metric sub-vectors, and a first plurality of value metric sub-vectors. In some examples, applying, via each of the one or more encoder layers, self-attention to the plurality of query queries, the plurality of key queries, and the plurality of value queries further includes dividing, at each decoder layer of the one or more decoder layers, the plurality of query queries, the plurality of key queries, and plurality of value queries respectively into an M-number of pluralities of query query sub-vectors, an M-number of pluralities of key query sub-vectors, and an M-number of pluralities of value query sub-vectors, wherein M corresponds to the number of attention heads in the first multi-headed attention block of the first decoder layer, and applying, via a first attention head of a first multi-headed attention block of each respective decoder layer of the one or more decoder layers, self-attention to a first plurality of query query sub-vectors, a first plurality of key query sub-vectors, and a first plurality of value query sub-vectors.

[0033] In some examples, the set of instructions may further be executable by the processor to apply, via one or more subsequent encoder layers, self-attention to a preceding output of a preceding encoder layer, wherein applying self-attention to the outputs of the preceding encoder layer includes generating, via a subsequent encoder layer of the one or more subsequent encoder layers, a plurality of query vectors, a plurality of key vectors, and a plurality of value vectors from the preceding output of the preceding encoder layer, applying, via the subsequent encoder layer, self-attention to the plurality of query vectors, the plurality of key vectors, and the plurality of value vectors generated from the preceding output of the preceding encoder layer, and apply, via one or more subsequent decoder layers, self-attention to a preceding output of a preceding decoder layer. Applying self-attention to the outputs of the preceding decoder layer may further include generating, via a subsequent decoder layer of the one or more subsequent decoder layers, a plurality of query vectors, a plurality of key vectors, and a plurality of value vectors from the preceding output of the preceding decoder layer, and applying, via the subsequent decoder layer, self-attention to the plurality of query vectors, the plurality of key vectors, and the plurality of value vectors generated from the preceding output of the preceding decoder layer.

[0034] In some examples, the set of instructions may further be executable by the processor to assign, based on an deformable encoder filter, a respective set of key metrics of the plurality of key metrics comprising a fixed number of keys metrics to each query metric of the plurality of query metrics, wherein applying self-attention further comprises producing a score matrix between each query metric and its respective set of key metrics; and determine, based on a deformable decoder filter, a set of value queries of the plurality of value queries comprising a fixed number of value queries, wherein each value query of the set of value queries is selected to correspond to locations of the one or more feature map exhibiting prominent features over other pixels, wherein applying self-attention further comprises multiplying the attention weights, generated based on the plurality of query queries and the plurality of key queries, with the set of value queries.

[0035] In yet further examples, the set of instructions may further be executable by the processor to enforce, via the computer system, unique matching between each of one or more ground-truth text boxes of the input image and the set of predicted text boxes of the input image, based on cost-based matching algorithm, wherein each ground-truth text box is uniquely matched to one corresponding predicted text box of the set of predicted text boxes.

[0036] In the following description, for the purposes of explanation, numerous details are set forth to provide a thorough understanding of the described embodiments. It will be apparent to one skilled in the art, however, that other embodiments may be practiced without some of these details. In other instances, structures and devices are shown in block diagram form. Several embodiments are described herein, and while various features are ascribed to different embodiments, it should be appreciated that the features described with respect to one embodiment may be incorporated with other embodiments as well. By the same token, however, no single feature or features of any described embodiment should be considered essential to every embodiment of the invention, as other embodiments of the invention may omit such features.

[0037] Unless otherwise indicated, all numbers used herein to express quantities, dimensions, and so forth used should be understood as being modified in all instances by the term "about." In this application, the use of the singular includes the plural unless specifically stated otherwise, and use of the terms "and" and "or" means "and/or" unless otherwise indicated. Moreover, the use of the term "including," as well as other forms, such as "includes" and "included," should be considered non-exclusive. Also, terms such as "element" or "component" encompass both elements and components comprising one unit and elements and components that comprise more than one unit, unless specifically stated otherwise.

[0038] The various embodiments include, without limitation, methods, systems, apparatuses, and/or software products. Merely by way of example, a method might comprise one or more procedures, any or all of which may be executed by a computer system. Correspondingly, an embodiment might provide a computer system configured with instructions to perform one or more procedures in accordance with methods provided by various other embodiments. Similarly, a computer program might comprise a set of instructions that are executable by a computer system (and/or a processor therein) to perform such operations. In many cases, such software programs are encoded on physical, tangible, and/or non-transitory computer readable media (such as, to name but a few examples, optical media, magnetic media, and/or the like).

[0039] Various embodiments described herein, embodying software products and computer-performed methods, represent tangible, concrete improvements to existing technological areas, including, without limitation, machine-based text detection. Specifically, implementations of various embodiments provide for a way to detect text in a scene using a computer vision (CV) machine learning (ML) model, and specifically a transformer-based model. Conventional approaches to scene text detection rely on CNNs, which are inefficient and not generalized. CNN models operate on many inductive biases, and thus are not generalized well to different data sets. Specifically, CNN models are trained on specific data sets, in which anchor boxes are manually designed, and post-processing techniques, such as non-maximum suppression, are typically used to reduce false positive text detection. Moreover, statistics and training schedules are typically adapted for each specific data set. Thus, the training of CNNs is computationally inefficient, and the generalization of CNN models to other data sets is limited and inaccurate. The transformerbased approach, set forth below, allows for a more robust scene text detection solution, which requires fewer priors and inductive biases, less training data, and drastically improved generalization across different data sets.

[0040] To the extent any abstract concepts are present in the various embodiments, those concepts can be implemented as described herein by devices, software, systems, and methods that involve novel functionality (e.g., steps or operations), such as transformerbased scene text detection.

[0041] Fig. 1 is a schematic block diagram of a system 100 for transformer-based scene text detection. The system 100 includes data pre-processing logic 105, feature extraction network 110, token generation logic 115, queries 130, and transformer network 150. The transformer network 150 includes transformer encoder 120, and transformer decoder 140. The transformer encoder 120 may further include one or more encoder layers 125a-125n (collectively "encoder layers 125"), and the transformer decoder 140 may include one or more decoder layers 145a- 145n (collectively "decoder layers 145). It should be noted that the various components of the system 100 are schematically illustrated in Fig. 1, and that modifications to the various components and other arrangements of system 100 may be possible and in accordance with the various embodiments.

[0042] In various embodiments, the data pre-processing logic 105 may be coupled to the feature extraction network 110, which may be coupled to the token generation logic 115, which is in turn coupled to the transformer network 145. The transformer network 150 may include transformer encoder 120, and transformer decoder 140. The token generation logic 115 may be coupled to the transformer encoder 120. The transformer encoder 120 may further include one or more encoder layers 125a-125n. The transformer encoder 120 may be coupled to the output of the encoder layers 125 may feed to the input of the transformer decoder 140. Queries 130 may also be coupled to the transformer decoder 140. The transformer decoder 140 may further include one or more decoder layers 145a-145n.

[0043] In various embodiments, the data pre-processing logic 105, feature extraction network 110, token generation logic 115, and transformer network 150 may be implemented as hardware, and/or software running on one or more computer systems of the system 100. Accordingly, the computer systems may include one or more physical machines or one or more virtual machines (VM) configured to implement the data pre-processing logic 105, feature extraction network 110, token generation logic 115, and/or transformer network 150. The one or more computer systems may be arranged in a distributed (or centralized) architecture, such as in a cloud platform. In further embodiments, the data pre-processing logic 105, feature extraction network 110, and/or transformer network 150 may be implemented locally on a user device or computer system.

[0044] In various embodiments, the data pre-processing logic 105 may be configured to pre-process input data, such as training data, provided to the feature extraction network 110. Input data may include an image or a set of images. Thus, in some examples, the data pre-processing logic 105 may be configured to pre-process the images or set of images of the input data. In some examples, pre-processing may include, without limitation, randomly flipping (e.g., horizontal or vertical) and/or rotating images. In some examples, image rotations may occur randomly in the range of -10 degrees to 10 degrees of rotation. In further examples, image rotation may be include rotations greater than 10 degrees in either direction, for example, random rotations in the range of -45 degrees to 45 degrees, or rotations of less than 10 degrees in either direction, for example, random rotations in the range of -5 to 5 degrees, etc. In further examples, pre-processing of input data may further include, random rescaling of images and/or cropping of images. In some examples, images may be rescaled randomly to a scale of 0.5, 2.0, or unsealed (e.g., scale of 1.0) for training purposes. Thus, the model may be trained to identify text of varying sizes, orientation, spacing, and density. In some examples, the rescaled images may then be cropped to one or more resolutions, or into samples of a given resolution. For example, rescaled images may be cropped into 1280x1080 pixel resolution samples. In some examples, one or more 1280x1080 samples may be taken from a single image. The pre-processed input data from the pre-processing logic 105 may be fed into the feature extraction network 110.

[0045] In various embodiments, the feature extraction network 110, also referred to as a backbone network, may be configured to generate one or more feature maps of the input data. Accordingly, feature extraction network 110 may include various types of convolutional neural networks (CNN), such as residual neural networks (ResNet), recurrent neural network (RNN), etc., configured to extract and/or identify features in the input data, and further to generate feature maps. Feature extraction network 110 may include, for example, a feature pyramid network, configured to generate a feature pyramid from the one or more images of the input data. In some embodiments, the feature extraction network 110 may be configured to utilize a text filter or character filter to generate a feature map of textual features. In some examples, the feature extraction network 110 may be a ResNet-50. In other embodiments, deeper backbone networks may be adopted, such as a ResNet- 101, which may further improve the accuracy of scene text detection. Using a ResNet-50 backbone, a feature pyramid may be extracted for each image of the input data, which then generates 5 stages of feature maps Ci, C2, C3, C4, and Cs for each image. In some examples, the downsampling rate (e.g., reduction in sampling rate) may be given as 2 Z for a C/ feature map. Thus, feature map Ci may be downsampled by a factor of 2, feature map C2 may be downsampled by a factor of 4, and so forth.

[0046] Token generation logic 115 may then be configured to generate tokens from the feature maps (e.g., feature pyramids). In some examples, token generation logic 115 may be included as part of the feature extraction network 110. In other examples, the token generation logic may be standalone logic or part of the transformer network, configured to accept the feature maps as input from the feature extraction network. In some embodiments, token generation logic 115 may be configured to generate one or more image tokens from the one or more feature maps. For example, image tokens may refer to tokenized data contained within images. In various embodiments, tokenization refers to a process by which data, in this case image data, is represented numerically, for processing by a neural network. In some examples, an image token may represent the data in an image (or a feature map of an image) in the form of an array, the elements of the array corresponding to the pixels (or subsets of pixels) of the image and/or, in the following examples, a feature map of an image. In some examples, to generate image tokens, the image tokens may be sampled from feature maps of images. For example, feature maps from C2, C3, C4, and Cs may be selected for sampling of image tokens. To generate image tokens, for each feature map with dimensions C x H x W, where C is the number of channels (e.g., number of feature maps), H is the height and W is the width of the feature maps, the feature map is flattened to be a vector with dimensions C x HW, and image tokens may be sampled at a token sampling rate. After sampling from each feature map, the image tokens may be concatenated into a sequence of image tokens with dimensions C x N, where N is the total number of tokens.

[0047] The image tokens may then be provided as inputs to the transformer network 150, and specifically to the transformer encoder 120. In some embodiments, positional encoding may be added to the one or more image tokens to identify a location of a given image token in the sequence of image tokens. In various embodiments, the transformer encoder 120 may include one or more encoder layers 125a-125n employing respective multi-headed attention blocks. In some examples, the multi-headed attention blocks may employ a plurality of attention heads to apply self-attention to the respective Q, K, V component vectors of the one or more image tokens. In some examples, the transformer encoder 120 may be a deformable transformer encoder. The transformer encoder 120 and encoder layers 125a are described in greater detail below with respect to Fig. 2. The transformer encoder 120 may, thus, be configured to encode each of the image tokens with attention weight information to produce an encoded vector (e.g., sequence of encoded image tokens), and provide the encoded vector to the transformer decoder 140. Specifically, the encode vector may be encoded with information regarding how an image token attends to all other image tokens in the sequence.

[0048] In various embodiments, the transformer decoder 140 may include one or more decoder layers 145a-145n. Each of the decoder layers 145 may employ respective multi-headed attention blocks. As in the transformer encoder 120, the multi-headed attention blocks of the decoder layers 145 may include deformable attention heads. Thus, the transformer decoder 140 may similarly be a deformable transformer decoder. The transformer decoder 140 and decoder layers 145 are described in greater detail below with respect to Fig. 2.

[0049] In various embodiments, the outputs of the transformer encoder 120 may be fed to the transformer decoder 140 as a set of token queries representing image features. Specifically, the output of the transformer encoder 120 (e.g., a sequence of encoded image tokens) may be linearly transformed to generate query (Q) and key (K) queries to the multiheaded attention blocks of the decoder layers 145. The transformer decoder 140 may further be configured to receive queries 130. Queries 130 may include a set of one or more text queries, which may be linearly transformed to generate one or more respective V queries. In various embodiments, the set of text queries may be generated as a set of vectors ( x i< Ti< x 2< yz)> which represent coordinates of a predicted text box generated with a Gaussian distribution. Coordinates (xi, yj) may correspond to the coordinates of an upper left corner of a prediction box (e.g., a text box), and coordinates (xz, y ) may correspond to the coordinates of a lower right corner of the prediction box. Self-attention may then be applied to the generated Q, K, V queries, as will be described in greater detail below with respect to Fig. 2.

[0050] Thus, in various embodiments, the output of the transformer decoder 140 may be a set of predicted text boxes of an input image. The output may be represented as a matrix having dimensions batch, 8). Thus, the matrix may be considered a "batch" number of 8-length vectors. Each field of the vector (xi, yi, X2, yz, xs, ys, X4, y ) may correspond to 8 coordinates (e.g., 4 sets of (x, y) coordinates), the 4 sets of coordinates corresponding to the four corners of the predicted text box (e.g., a bounding box), and where batch is the total number of predicted text boxes (e.g., total number of text queries).

[0051] In various embodiments, during training, a loss function may be determined to enforce one-to-one matching between the outputs of text queries with ground-truth text boxes. In some embodiments, a Hungarian matching algorithm may be employed to uniquely match predicted text boxes with ground-truth text boxes (e.g., bipartite matching). In further embodiments, bounding boxes may further be refined by optimizing for bounding box loss (e.g., intersection of union (loU) loss). In one example, loss may be computed as: i = d=i Lbbox ba, bi)) Eq. (1)

Where o is the set of matched text queries, and the classification loss (Zi) is a non-negative likelihood. Lbbox may be a bounding box loss, which contains both Zi loss and loU loss. The Hungarian loss function makes the training process end-to-end with a combination of CNN and transformer-based models.

[0052] Fig. 2 is a schematic block diagram of transformer model architecture 200 for scene text detection, in accordance with various embodiments. The architecture 200 includes feature maps 210, transformer encoder 215, and transformer decoder 245. The transformer encoder 215 may further include one or more encoder layers 220a-220n (collective "encoder layers 220"), and the transformer decoder 245 may further include one or more decoder layers 250a-250n (collectively "decoder layers 250"). The encoder layer 220a may include multi-headed attention block 225, addition and normalization layer 230, feed forward network 235, and addition and normalization layer 240. Similarly, the decoder layer 250a may include multi-head attention block 255, addition and normalization layer 260, feed forward network 265, and addition and normalization layer 270. It should be noted that the various components of the architecture 200 are schematically illustrated in Fig. 2, and that modifications to the various components and other arrangements of architecture 200 may be possible and in accordance with the various embodiments.

[0053] As previously described with respect to Fig. 1 , the feature extraction network 205 may be configured to produce one or more feature maps 210. One or more image tokens may be generated from the one or more feature maps 210. The image tokens are then concatenated as a sequence of image tokens, and fed to the transformer encoder 215 as inputs. Accordingly, the multi-head attention block 225 may include a plurality of attention heads, for example, an M-number of attention heads. [0054] In various embodiments, a set of query (Q), key (K), and value (V) vectors may then be generated from the sequence of image tokens. The Q, K, and V vectors generated from the image tokens may be referred to as Q, K, and V metrics. In some examples, the Q, K, and V metrics may be divided into M-sets of Q, K, V metrics corresponding to each of the respective attention heads. In some embodiments, generating Q, K, and V metrics may include applying a respective linear transformation to each image token (e.g., image token vector) of the sequence of image tokens.

[0055] Thus, after feeding the image tokens through a linear layer, Q, K, and V metrics are generated for each image token vector. Once Q, K, and V vectors (e.g., Q, K, V metrics) are generated, self-attention may be applied to each of the Q, K, and V metrics. In some examples, applying self-attention comprises generating a score matrix by dot product multiplication of the Q and K metrics (e.g., a score matrix quantifying the attention of a given image token to the entire image), and scaling of the score matrix to produce attention scores. Scaling, for example, may include dividing the score matrix by the dimension of the K (dk), which in this case is the same as the dimension of Q. In some further examples, scaling may be accomplished by dividing the score matrix by the square root of dk, as opposed to dk, where appropriate. A softmax of the attention score may be taken to generate attention weights. The attention weights may then be multiplied with the V metric to generate an output vector (e.g., a sequence of encoded image token vector).

[0056] For example, self- attention may be applied as set forth by the following formula:

Attention (Q, K, 7) = softmax ( V Eq. (2) where K T is the transpose of the K vector.

[0057] In further examples, once the multi-head attention block 225 generates the respective output vectors, the output vectors of each of the attention heads may be concatenated to form a single output vector of the multi-head attention block 225. The output vector may be passed to addition and normalization layer 230, which may add the output vector to the original input vector (e.g., sequence of image tokens) to produce a residual output, which is then normalized. The normalized residual output may then be passed to a feed forward network (FFN) 235. In some embodiments, the FFN 235 may be a fully-connected multilayer perceptron (MLP). In some examples, the output of the FFN 235 may again be passed to addition and normalization layer 240. In some examples, the addition and normalization layer 240 may add the output of the FFN 235 to the normalized residual output, and again normalized.

[0058] In some embodiments, the transformer encoder 215 may be a deformable transformer encoder, in which the attention heads of the multi head attention block 225 may comprise deformable attention heads. Thus, the input and output of the encoder may be multi-scale feature maps of the same resolution. As previously described, the feature maps from stages C2-C5 of the feature extraction network may be used. In some examples, for deformable multi-headed attention, the transform encoder 215 may be configured to process input feature maps to selectively sample a set of locations that are pre-filtered for having prominent features over other pixels. Thus, the deformable multi-headed attention module may attend only to a set of key sampling points around a reference point. In some examples, a fixed number of keys may be assigned for each query, thus more efficiently determine attention weights.

[0059] In some embodiments, the transformer encoder 215 may include one or more encoder layers 220a-220n. In some embodiments, the output vector of the first encoder layer 220a may be provided as a further input to a subsequent encoder layer, and the output of the subsequent encoder layer fed to a subsequent encoder layer until the n-th encoder layer 220n. Accordingly, in some embodiments, the transformer encoder 215 may include 6 consecutive encoder layers 220, each with a respective multi-head attention block, addition and normalization layers, and FFN.

[0060] In some embodiments, the output of the transformer encoder 215 (e.g., the last encoder layer 220n of the transformer encoder 215) may be fed to the transformer decoder 245 (e.g., a first decoder layer 250a of the transformer decoder). Like the transformer encoder 215, the transformer decoder 245 may include one or more decoder layers 250a-250n. A first decoder layer 250a may include a multi-head attention block 255, which may comprise a plurality of attention heads. The multi-head attention block may be fed the output vector of the from the transformer encoder 215.

[0061] As previously described, the output vector of the transformer encoder 215 may be an input vector, also referred to as token queries. The token queries may, thus, represent image features (e.g., image tokens) as encoded by the transformer encoder. In various embodiments, Q query and K query (vectors) may be generated from the token queries. The transformer decoder 245 may further be configured to receive a set of text queries. As previously described, text queries may be initialized (e.g., generated) based on a Gaussian distribution of coordinates. In some examples, the text queries may respectively be sets of vectors (xi, yi, xz, yz), where coordinates (xi, yi may correspond to the coordinates of an upper left corner of a prediction box (e.g., a text box), and coordinates (xz, yz) may correspond to the coordinates of a lower right corner of the prediction box. A V query (vector) may be generated based on the text queries. To generate the Q, K, and V vectors (referred to in the context of the transformer decoder 245 as Q, K, and V queries), a respective linear transform may be applied to each of the input vectors (e.g., passing the token queries and text queries through a linear layer).

[0062] Once the Q, K, and V queries are generated, self- attention may be applied, as set forth in the following formula:

[0063] Thus, the output of the transformer decoder 245 may be a set of predicted text boxes of an input image, in vector form. For example, the output of the transformer decoder 245 may be a matrix of size (batch, 8), where the scalar values of the vector correspond to 8 coordinates of a bounding box, (xi, yi, xz, yz, xz, yz, X4, y4), that is, 4 sets of (x, y) coordinates, the 4 sets of coordinates corresponding to the four corners of the predicted text box (e.g., a bounding box), and where batch is the total number of predicted text boxes (e.g., total number of text queries). In yet further embodiments, the batch may be determined by the number of token queries for corresponding to features (e.g., text, characters, etc.), thus a bounding box predicted for each feature.

[0064] As in the encoder layers 220, in further examples, once the multi-head attention block 255 generates the respective output vectors, the output vectors of each of the attention heads may be concatenated to form a single output vector of the multi-head attention block 255. The concatenated output vector may be passed to addition and normalization layer 260, FFN 265, and addition and normalization layer 270. Thus, like the encoder layer 220a, the addition and normalization layer 260 may be configured to add the output vector to the original input vector (e.g., output vector from the transformer encoder 215, such as an encoded sequence of image tokens) to produce a residual output, which is then normalized. The normalized residual output may then be passed to the FFN 265. In some embodiments, the FFN 265 may be a fully-connected MLP, the output of which may again be passed to addition and normalization layer 270. In some examples, the addition and normalization layer 270 may add the output of the FFN 235 to the normalized residual output, and again normalized.

[0065] In some embodiments, a transformer decoder 245 may further be a deformable transformer decoder. In some examples, with a deformable transformer decoder 245, the deformable attention head may be configured to predict a bounding box as relative offsets relative to the reference point of the input features (e.g., image tokens). In some examples, the reference point may be a predicted box center for a text box. In some embodiments, the offsets may be predicted for a 3x3 matrix of points. In this way, a deformable transformer encoder 215 and/or transformable decoder 245 allows for the transformer to attend to sparse, more meaningful locations (e.g., feature locations and/or bounding box locations), against global information.

[0066] Like, the transformer encoder 215, in further embodiments, the transformer decoder 245 may include one or more decoder layers 250a-250n. In some embodiments, the output vector of the first decoder layer 250a may be provided as a further input to a subsequent decoder layer, and the outputs of subsequent decoder layers 250 fed to further decoder layers until the n-th decoder layer 250n. Accordingly, in some embodiments, the transformer decoder 245 may include 6 consecutive decoder layers 250, each with a respective multi-head attention block, addition and normalization layers, and FFN. The output of the transformer decoder 245, and specifically the n-th decoder layer 250n may be the set of output predictions 275. As previously described, the set of output predictions may be the predicted text boxes of an input image. In some examples, the output may be a matrix of size batch, 8).

[0067] Fig. 3 is a sequence diagram of a system 300 for transformer-based scene text detection, in accordance with various embodiments. The system 300 includes data preprocessing logic 305, feature extraction network 310, transformer encoder 315, and transformer decoder 320. . It should be noted that the various components of the system 300 are schematically illustrated in Fig. 3, and that modifications to the various components and other arrangements of system 300 may be possible and in accordance with the various embodiments.

[0068] Here, the sequence diagram begins with the data pre-processing logic 305 receiving input data. As previously described, input data may include an input image, or a set of input images. In some examples, the input images may be part of a set of training data. Data pre-processing logic 305 may be configured to pre-process the data as previously described. For example, pre-processing may include one or more of flipping, rotating, scaling, and cropping of images. The pre-processed input data may then be provided to the feature extraction network 310.

[0069] The feature extraction network 310 may be configured to generate feature maps from the pre-processed input data. Accordingly, the feature extraction network 310 may include one or more of a CNN, ResNet, RNN, or other convolutional network configured to generate feature maps. In some embodiments, the feature extraction network 310 may be configured to generate tokens from the feature maps. In some examples, to generate image tokens, image tokens may be sampled from the feature maps. Each feature map may be flattened to be a vector with dimensions C x HW, and image tokens may be produced by sampling at a token sampling rate. After sampling from each feature map, the image tokens may be concatenated into a sequence of image tokens. As previously described, in some examples, the image tokens may alternatively be generated by a standalone component, or at the transformer encoder 315.

[0070] The image tokens may then be transmitted, from the feature extraction network 310, to the transformer encoder 315. In various embodiments, the transformer encoder 315 may include one or more encoder layers employing respective multi-headed attention blocks. In some examples, the multi-headed attention blocks may employ a plurality of attention heads to apply self-attention to respective Q, K, V vectors of the one or more image tokens.

[0071] Accordingly, in various embodiments, the transformer encoder 315 may be configured to first generate a set of Q, K, V vectors from the image tokens (e.g., Q, K, V metrics), as described above with respect to Figs. 1 & 2. In some examples, once the Q, K, V metrics are generated, the Q, K, V metrics may be divided into M-number of vectors corresponding to each attention head of the multi-headed attention block. Self-attention may then be applied to the Q, K, V metrics to generate an output vector (e.g., encoded image token vector). In some embodiments, the encoded image token vector may then be provided as an input vector to a subsequent encoder layer of the transformer encoder 315. In some examples, each encoded image token vector may correspond to an image feature, such as text or a character, and how it relates to all other image features in the input image. In some examples, at each encoder layer, the output vector may be passed through an addition and normalization layer, FFN, and further addition and normalization layer, before being fed to a subsequent layer. [0072] In various embodiments, the output of the final encoder layer may then be transmitted as set of token queries to the transformer decoder 320. The transformer decoder 320 may further be configured to accept text queries. The transformer decoder 320 may include one or more decoder layers, where each decoder layer employs respective multiheaded attention blocks. Accordingly, in various embodiments, the outputs of the transformer encoder 315 may be fed to the transformer decoder 320 as token queries representing image features. The token queries may be linearly transformed to generate query (Q) and key (K) queries to the multi-headed attention blocks of the decoder layers, and the text queries may be linearly transformed to generate one or more respective V queries. In various embodiments, text queries may be generated as a set of vectors (x p To x 2> Yz)> in which the scalar values of the vector are coordinates of a predicted text box generated with a Gaussian distribution.

[0073] In various embodiments, the transformer decoder 320 may then apply selfattention to the generated Q, K, V queries (vectors) to generate an output vector (e.g., a predicted text box). In some embodiments, the predicted text box may be provided as an input vector to a subsequent decoder layer. At each decoder layer, the output vector may further be passed through an addition and normalization layer, FFN, and further addition and normalization layer, before being fed to a subsequent layer. In some examples, the output of the transformer decoder 320 may be a set of predicted text boxes of an input image.

[0074] In various embodiments, during training, a one-to-one matching between the outputs of text queries with ground-truth text boxes may be enforced based on a cost-based matching algorithm. In some embodiments, a Hungarian matching algorithm may be employed to uniquely match predicted text boxes with ground-truth text boxes (e.g., bipartite matching). In further embodiments, bounding boxes may further be refined by optimizing for bounding box loss (e.g., intersection of union (loU) loss).

[0075] In various embodiments, optimizing for bounding box loss may include optimization of one or more parameters to minimize bounding box loss. For example, optimization of the parameters of the feature extraction network may include modification of the filters for feature extraction, and for the deformable transformer encoder, and/or decoder, modification of the various filters employed, such as in the FFN, linear layers, etc. Modification may also be made to the generation of image tokens and/or text queries, and the offsets / filters used for inputting to the deformable transformer encoder and/or decoder. [0076] Fig. 4 is a flow diagram of a method for transformer-based scene text detection, in accordance with various embodiments. The method 400 begins, at block 405, by pre-processing input images. As described in the examples above, input data may include an input image, or a set of input images. In some examples, the input images may be part of a set of training data. In some embodiments, data pre-processing logic may be implemented to pre-process the input images. Pre-processing may include one or more of flipping, rotating, scaling, and cropping of images.

[0077] The method 400 may continue, at block 415, by generating one or more feature maps of an input image. As previously described, in some examples, a feature extraction network may be configured to generate one or more feature maps from an input image. In some examples, the input image may be a pre-processed input image. In some embodiments, the feature extraction network may be a CNN, ResNet, RNN, or other type of convolutional network. Thus, generating the one or more feature maps may include applying one or more convolutional filters to an input image. For example, in some embodiments, a filter, such as a text filter or character filter (e.g., a filter optimized to identify textual features) to generate a feature map of textual features. As previously described, in some examples, the feature extraction network may be a ResNet-50. Using a ResNet-50 backbone, a feature pyramid may be extracted for each image of the input data, which then generates 5 stages of feature maps Ci, C2, C3, C4, and C5 for each image.

[0078] The method 400 may continue, at block 415, by generating image tokens from the feature maps. In some embodiments, to generating image tokens may include sampling the one or more image tokens at a sampling rate. For example, each feature map may be flattened to be a vector with dimensions C x HW. Image tokens may be produced by sampling the vector at a token sampling rate. In some examples, the token sampling rate may be determined based on the size, shape, and/or orientation of a feature (e.g., a textual feature, such as a letter, character, word, etc.). After sampling from each feature map, the image tokens may be concatenated into a sequence of image tokens.

[0079] At block 420, the method 400 continues by providing the one or more image tokens to a transformer encoder. In various embodiments, the one or more image (e.g., sequence of image tokens) may be provided as an input vector to the transformer encoder. As previously described, the transformer encoder may include one or more encoder layers, each encoder layer employing a respective multi-headed attention block. The method may continue, at block 425, by transforming the input vector into Q, K, V metrics. For example, in various embodiments, the transformer encoder may be configured to generate Q, K, V metrics. Generating the Q, K, V metrics may further include applying a linear transform to the input vector (e.g., pass the input vector through a linear layer). Thus, each image token (which is a vector) of the sequence of image tokens may have a respective linear transforms applied to generate Q, K, and V. For example, a first linear transform may be applied to the input vector to generate the Q metric (which is also a vector), a second linear transform may be applied to the input vector to generate the K metric, and a third linear transform may be applied to the input vector to generate the V metric.

[0080] At block 430, the method 400 continues by applying self-attention to the Q, K, V metrics. As previously described, applying self-attention may include multiplying V with the softmax of the downscaled dot product of Q and K. The dot product of Q and K may be downscaled by dividing by the dimension of k (A), or alternatively, as previously described, the square rook of dk. The self-attention operation is set forth in Eq. 2. After selfattention, in some examples, the resulting output vector may be normalized, through an addition and normalization layer, passed through an FFN, and normalized again through a subsequent addition and normalization layer. In various embodiments, each encoder layer of the transform encoder may include a multi-head attention block, having an M-number of attention heads. Thus, to perform multi-headed self-attention, the Q, K, and V metrics may be divided into an M-number of component vectors, each attention head applying selfattention to a respectively divided Q, K, V component vector for the Q, K, V metrics. Moreover, in further examples, the self-attention process may be repeated at each encoder layer. In some embodiments, the output of the last encoder layer (e.g., the n-th encoder layer) may be an encoded sequence of image tokens. Each image token may be a vector corresponding to a respective image feature.

[0081] At block 435, the output of the transformer encoder may be provided to a transformer decoder as an input vector, referred to as a set of token queries. The decoder may further receive, as an input vector, a set of text queries. The transformer decoder may include one or more decoder layers, where each decoder layer employs respective multiheaded attention blocks. Accordingly, in various embodiments, the outputs of the transformer encoder may be fed to the transformer decoder as token queries representing image features, and text queries may further be fed to the transformer decoder, each text query representing the coordinates of a predicted text box. [0082] At block 440, the method 400 may continue, by generating Q, k, V queries from the input vectors (token queries and text queries). In some embodiments, the token queries may be linearly transformed to generate query (Q) and key (K) queries to the multiheaded attention blocks of the decoder layers, and the text queries may be linearly transformed to generate one or more respective V queries. In various embodiments, text queries may be generated as a set of vectors (x 1 , y 1 , x 2 , y 2 ), in which the scalar values of the vector are coordinates of a predicted text box generated with a Gaussian distribution.

[0083] At block 445, the method 400 may continue by applying self- attention to the Q, K, V queries. In various embodiments, the transformer decoder may apply self-attention to the generated Q, K, V queries to generate an output vector (e.g., a predicted text box). In some embodiments, the predicted text box may be provided as an input vector to a subsequent decoder layer. At each decoder layer, the output vector may further be passed through an addition and normalization layer, FFN, and further addition and normalization layer, before being fed to a subsequent layer.

[0084] At block 450, the method 400 continues by generating one or more predicted text boxes. In some examples, the output of the transformer decoder may be a set of predicted text boxes of an input image. The output vector may, in some examples, be a matrix of size (batch, 8), as previously described. Thus, a batch number of vectors may be generated, each of the vectors being a set of 8 coordinates defining the coordinates of a predicted text box.

[0085] At block 455, the method 400 further includes determining an optimal matching between the one or more predicted text boxes and ground-truth text boxes for an input image. For example, in various embodiments, during training, a loss function may be determined to enforce one-to-one matching between the outputs of text queries with groundtruth text boxes. In some embodiments, a Hungarian matching algorithm may be employed to uniquely match predicted text boxes with ground-truth text boxes (e.g., bipartite matching).

[0086] At block 460, the method 400 may continue with optimization of bounding box losses (e.g., intersection of union (loU) loss). In various embodiments, optimizing for bounding box loss may include optimization of one or more parameters to minimize bounding box loss. For example, optimization of the parameters of the feature extraction network may include modification of the filters for feature extraction, and for the deformable transformer encoder, and/or decoder, modification of the various filters employed, such as in the FFN, linear layers, etc. Modification may also be made to the generation of image tokens and/or text queries, and the offsets / filters used for inputting to the deformable transformer encoder and/or decoder.

[0087] The techniques and processes described above with respect to various embodiments may be performed by one or more computer systems. Fig. 5 is a schematic block diagram of a computer system for transformer-based scene text detection, in accordance with various embodiments. Fig. 5 provides a schematic illustration of one embodiment of a computer system 500, such as the systems 100, 200 or subsystems thereof, which may perform the methods provided by various other embodiments, as described herein. It should be noted that Fig. 5 only provides a generalized illustration of various components, of which one or more of each may be utilized as appropriate. Fig. 5, therefore, broadly illustrates how individual system elements may be implemented in a relatively separated or relatively more integrated manner.

[0088] The computer system 500 includes multiple hardware elements that may be electrically coupled via a bus 505 (or may otherwise be in communication, as appropriate). The hardware elements may include one or more processors 510, including, without limitation, one or more general-purpose processors and/or one or more special-purpose processors (such as microprocessors, digital signal processing chips, graphics acceleration processors, and microcontrollers); one or more input devices 515, which include, without limitation, a mouse, a keyboard, one or more sensors, and/or the like; and one or more output devices 520, which can include, without limitation, a display device, and/or the like.

[0089] The computer system 500 may further include (and/or be in communication with) one or more storage devices 525, which can comprise, without limitation, local and/or network accessible storage, and/or can include, without limitation, a disk drive, a drive array, an optical storage device, solid-state storage device such as a random-access memory ("RAM") and/or a read-only memory ("ROM"), which can be programmable, flash- updateable, and/or the like. Such storage devices may be configured to implement any appropriate data stores, including, without limitation, various file systems, database structures, and/or the like.

[0090] The computer system 500 might also include a communications subsystem 530, which may include, without limitation, a modem, a network card (wireless or wired), an IR communication device, a wireless communication device and/or chipset (such as a Bluetoothâ„¢ device, an 802.11 device, a WiFi device, a WiMax device, a WWAN device, a Z-Wave device, a ZigBee device, cellular communication facilities, etc.), and/or a low- power wireless device. The communications subsystem 530 may permit data to be exchanged with a network (such as the network described below, to name one example), with other computer or hardware systems, between data centers or different cloud platforms, and/or with any other devices described herein. In many embodiments, the computer system 500 further comprises a working memory 535, which can include a RAM or ROM device, as described above.

[0091] The computer system 500 also may comprise software elements, shown as being currently located within the working memory 535, including an operating system 540, device drivers, executable libraries, and/or other code, such as one or more application programs 545, which may comprise computer programs provided by various embodiments, and/or may be designed to implement methods, and/or configure systems, provided by other embodiments, as described herein. Merely by way of example, one or more procedures described with respect to the method(s) discussed above might be implemented as code and/or instructions executable by a computer (and/or a processor within a computer); in an aspect, then, such code and/or instructions can be used to configure and/or adapt a general purpose computer (or other device) to perform one or more operations in accordance with the described methods.

[0092] A set of these instructions and/or code might be encoded and/or stored on a non-transitory computer readable storage medium, such as the storage device(s) 525 described above. In some cases, the storage medium might be incorporated within a computer system, such as the system 500. In other embodiments, the storage medium might be separate from a computer system (i.e., a removable medium, such as a compact disc, etc.), and/or provided in an installation package, such that the storage medium can be used to program, configure, and/or adapt a general purpose computer with the instructions/code stored thereon. These instructions might take the form of executable code, which is executable by the computer system 500 and/or might take the form of source and/or installable code, which, upon compilation and/or installation on the computer system 500 (e.g., using any of a variety of generally available compilers, installation programs, compression/decompression utilities, etc.) then takes the form of executable code.

[0093] It will be apparent to those skilled in the art that substantial variations may be made in accordance with specific requirements. For example, customized hardware (such as programmable logic controllers, single board computers, FPGAs, ASICs, and SoCs) might also be used, and/or particular elements might be implemented in hardware, software (including portable software, such as applets, etc.), or both. Further, connection to other computing devices such as network input/output devices may be employed.

[0094] As mentioned above, in one aspect, some embodiments may employ a computer or hardware system (such as the computer system 500) to perform methods in accordance with various embodiments of the invention. According to a set of embodiments, some or all of the procedures of such methods are performed by the computer system 500 in response to processor 510 executing one or more sequences of one or more instructions (which might be incorporated into the operating system 540 and/or other code, such as an application program 545) contained in the working memory 535. Such instructions may be read into the working memory 535 from another computer readable medium, such as one or more of the storage device(s) 525. Merely by way of example, execution of the sequences of instructions contained in the working memory 535 might cause the processor(s) 510 to perform one or more procedures of the methods described herein.

[0095] The terms "machine readable medium" and "computer readable medium," as used herein, refer to any medium that participates in providing data that causes a machine to operate in a specific fashion. In an embodiment implemented using the computer system 500, various computer readable media might be involved in providing instructions/code to processor(s) 510 for execution and/or might be used to store and/or carry such instructions/code (e.g., as signals). In many implementations, a computer readable medium is a non-transitory, physical, and/or tangible storage medium. In some embodiments, a computer readable medium may take many forms, including, but not limited to, non-volatile media, volatile media, or the like. Non-volatile media includes, for example, optical and/or magnetic disks, such as the storage device(s) 525. Volatile media includes, without limitation, dynamic memory, such as the working memory 535. In some alternative embodiments, a computer readable medium may take the form of transmission media, which includes, without limitation, coaxial cables, copper wire and fiber optics, including the wires that comprise the bus 505, as well as the various components of the communication subsystem 530 (and/or the media by which the communications subsystem 530 provides communication with other devices). In an alternative set of embodiments, transmission media can also take the form of waves (including, without limitation, radio, acoustic, and/or light waves, such as those generated during radio-wave and infra-red data communications). [0096] Common forms of physical and/or tangible computer readable media include, for example, a floppy disk, a flexible disk, a hard disk, magnetic tape, or any other magnetic medium, a CD-ROM, any other optical medium, punch cards, paper tape, any other physical medium with patterns of holes, a RAM, a PROM, and EPROM, a FLASH-EPROM, any other memory chip or cartridge, a carrier wave as described hereinafter, or any other medium from which a computer can read instructions and/or code.

[0097] Various forms of computer readable media may be involved in carrying one or more sequences of one or more instructions to the processor(s) 510 for execution. Merely by way of example, the instructions may initially be carried on a magnetic disk and/or optical disc of a remote computer. A remote computer might load the instructions into its dynamic memory and send the instructions as signals over a transmission medium to be received and/or executed by the computer system 500. These signals, which might be in the form of electromagnetic signals, acoustic signals, optical signals, and/or the like, are all examples of carrier waves on which instructions can be encoded, in accordance with various embodiments of the invention.

[0098] The communications subsystem 530 (and/or components thereof) generally receives the signals, and the bus 505 then might carry the signals (and/or the data, instructions, etc. carried by the signals) to the working memory 535, from which the processor(s) 510 retrieves and executes the instructions. The instructions received by the working memory 535 may optionally be stored on a storage device 525 either before or after execution by the processor(s) 510.

[0099] While some features and aspects have been described with respect to the embodiments, one skilled in the art will recognize that numerous modifications are possible. For example, the methods and processes described herein may be implemented using hardware components, software components, and/or any combination thereof. Further, while various methods and processes described herein may be described with respect to particular structural and/or functional components for ease of description, methods provided by various embodiments are not limited to any particular structural and/or functional architecture but instead can be implemented on any suitable hardware, firmware and/or software configuration. Similarly, while some functionality is ascribed to one or more system components, unless the context dictates otherwise, this functionality can be distributed among various other system components in accordance with the several embodiments. [0100] Moreover, while the procedures of the methods and processes described herein are described in a particular order for ease of description, unless the context dictates otherwise, various procedures may be reordered, added, and/or omitted in accordance with various embodiments. Moreover, the procedures described with respect to one method or process may be incorporated within other described methods or processes; likewise, system components described according to a particular structural architecture and/or with respect to one system may be organized in alternative structural architectures and/or incorporated within other described systems. Hence, while various embodiments are described with or without some features for ease of description and to illustrate aspects of those embodiments, the various components and/or features described herein with respect to a particular embodiment can be substituted, added and/or subtracted from among other described embodiments, unless the context dictates otherwise. Consequently, although several embodiments are described above, it will be appreciated that the invention is intended to cover all modifications and equivalents within the scope of the following claims.