XU YIHENG (US)
XU YANG (US)
WEI FURU (US)
WANG ZILONG (US)
US20210081729A1 | 2021-03-18 |
TOMASZ STANISLAWEK ET AL: "Kleister: Key Information Extraction Datasets Involving Long Documents with Complex Layouts", ARXIV.ORG, 12 May 2021 (2021-05-12), Ithaca, XP055958428, Retrieved from the Internet
ZILONG WANG ET AL: "LayoutReader: Pre-training of Text and Layout for Reading Order Detection", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 26 August 2021 (2021-08-26), XP091040533
CLAIMS 1. A computer-implemented method comprising: determining a text sequence and layout information presented in a document, the text sequence comprising a plurality of text elements, the layout information indicating a spatial layout of the plurality of text elements in the document; generating a plurality of semantic feature representations corresponding to the plurality of text elements based at least on the text sequence and the layout information; and determining a reading order of the plurality of text elements in the document based on the plurality of semantic feature representations. 2. The method of claim 1, wherein generating the plurality of semantic feature representations comprises: for a first text element in the text sequence, determining an attention weight for the first text element with respect to a second text element in the text sequence based on at least one of the following: a relative spatial positioning of the first text element with respect to the second text element in the document, and a relative ranking position of the first text element with respect to the second text element in the text sequence, the attention weight indicating an importance degree of the second text element to the first text element; and determining a semantic feature representation of the first text element by weighting an embedding representation of the second text element with the determined attention weight. 3. The method of claim 1, wherein generating the plurality of semantic feature representations comprises: determining an image-format file corresponding to the document; determining visual information from the image-format file, the visual information indicating visual appearances of the plurality of text elements presented in the document; and generating the plurality of semantic feature representations further based on the visual information. 4. The method of claim 1, wherein generating the plurality of semantic feature representations comprises: converting the text sequence and the layout information into a first embedding representation and a second embedding representation, respectively; concatenating the first embedding representation and the second embedding representation, to obtain a concatenated embedding representation; and applying the concatenated embedding representation into a trained feature extraction model to generate the plurality of semantic feature representations. 5. A computer-implemented method comprising: determining a text sequence, layout information and order labeling information presented in a first sample document, the text sequence comprising a first set of text elements, the layout information indicating a spatial layout of the first set of text elements in the first sample document, the order labeling information indicating a ground-truth reading order of the first set of text elements in the first sample document; generating, using a feature extraction model, respective semantic feature representations of the first set of text elements based at least on the text sequence and the layout information; determining, using an order determination model, a predicted reading order of the first set of text elements in the first sample document based on the semantic feature representations; and training the feature extraction model and order determination model based on a difference between the predicted reading order and the ground-truth reading order. 6. The method of claim 5, wherein the first sample document comprises an editable text document, and determining the order labeling information comprises: determining format information corresponding to the editable text document, the format information at least indicating the ground-truth reading order of the first set of text elements. 7. The method of claim 5, wherein determining the layout information comprises: determining a vector file corresponding to the first sample document; and determining the layout information of the first set of text elements from the vector file. 8. The method of claim 7, wherein a plurality of text elements that occur at different positions in the first sample document and represent a same text are assigned with different indices, and the plurality of text elements are labeled with different colors in the vector file, the color with which each text element is labeled being determined based on the index assigned to the text element; and wherein determining the layout information of the first set of text elements from the vector file comprises: assigning, based on the indices and the colors assigned to the plurality of text elements, layout information determined from the vector file to the plurality of text elements extracted from the first sample document. 9. The method of claim 5, wherein generating the semantic feature representations comprises: determining visual information from a first image-format file corresponding to the first sample document, the visual information representing visual appearances of the first set of text elements presented in the first sample document; and generating, using the feature extraction model, the semantic feature representations further based on the visual information. 10. The method of claim 5, further comprising obtaining the pre-trained feature extraction model by: determining a second image-format file corresponding to a second sample document, the second sample document comprising a second set of text elements; generating, by masking at least one text element of the second set of text elements in the second image-format file, first masking information to indicate that the at least one text element is masked and other text elements of the second set of text elements are not masked; determining, using the feature extraction model, respective semantic feature representations of the second set of text elements; determining second masking information based on the respective semantic feature representations of the second set of text elements, the second masking information indicating whether respective text elements of the second set of text elements are masked; and pre-training the feature extraction model based on a difference between the first masking information and the second masking information. 11. The method of claim 9, wherein the feature extraction model is further configured to generate a first visual feature representation of the first image-format file based on the text sequence, the layout information and the visual information, wherein the method further comprises obtaining the pre-trained feature extraction model by: determining a third sample document, a third image-format file, and match labeling information, the match labeling information indicating whether the third image-format file matches with the third sample document; generating, using the feature extraction model, a second visual feature representation of the third image-format file based on the third sample document and the third image-format file; determining, based on the second visual feature representation, a match result indicating whether the third image-format file matches with the third sample document; and pre-training the feature extraction model based on a difference between the match result and the match labeling information. 12. An electronic device, comprising: a processor; and a memory coupled to the processor and having instructions stored thereon, the instructions, when executed by the processor, causing the device to perform acts comprising: determining a text sequence and layout information presented in a document, the text sequence comprising a plurality of text elements, the layout information indicating a spatial layout of the plurality of text elements in the document; generating a plurality of semantic feature representations corresponding to the plurality of text elements based at least on the text sequence and the layout information; and determining a reading order of the plurality of text elements in the document based on the plurality of semantic feature representations. 13. The device of claim 12, wherein generating the plurality of semantic feature representations comprises: for a first text element in the text sequence, determining an attention weight for the first text element with respect to a second text element in the text sequence based on at least one of the following: a relative spatial positioning of the first text element with respect to the second text element in the document, and a relative ranking position of the first text element with respect to the second text element in the text sequence, the attention weight indicating an importance degree of the second text element to the first text element; and determining a semantic feature representation of the first text element by weighting an embedding representation of the second text element with the determined attention weight. 14. An electronic device, comprising: a processor; and a memory coupled to the processor and having instructions stored thereon, the instructions, when executed by the processor, causing the device to perform acts comprising: determining a text sequence, layout information and order labeling information presented in a first sample document, the text sequence comprising a first set of text elements, the layout information indicating a spatial layout of the first set of text elements in the first sample document, the order labeling information indicating a ground-truth reading order of the first set of text elements in the first sample document; generating, using a feature extraction model, respective semantic feature representations of the first set of text elements based at least on the text sequence and the layout information; determining, using an order determination model, a predicted reading order of the first set of text elements in the first sample document based on the semantic feature representations; and training the feature extraction model and order determination model based on a difference between the predicted reading order and the ground-truth reading order. 15. The device of claim 14, wherein the first sample document comprises an editable text document, and determining the order labeling information comprises: determining format information corresponding to the editable text document, the format information at least indicating the ground-truth reading order of the first set of text elements. |
In some implementations, the text elements in the text sequence 312 may also have respective sequence index information, which is to indicate sequential positions in the text sequence 312. Different from the two-dimensional relative spatial positions of the document 162 indicated by the layout information, the sequence index information is used to indicate a relative position of the text element in a one-dimensional text sequence 312, and therefore may also be regarded as one-dimensional position information. It is possible to assign corresponding sequence index information to each text element in order from a starting text element of the text sequence 312. In some implementations, if the input sequence S shown in the above Equation (1) is constructed, each text element or symbol in the sequence may have corresponding sequence index information. Therefore, for the text sequence 312, the input embedding representation that may be constructed accordingly includes the embedding representation 330 of the text element, the embedding representation 332 of the segment identification information and the embedding representation 333 of sequence index information. For each text element, the constructed embedding representation may be represented as follows: where represents the embedding representation of the i th text element w i , represents the embedding representation of the sequence index information of the text element w i , and represents the embedding representation of the segment identification information of the text element Wi. It should be appreciated that the above are only examples. In other implementations, the embedding representations of sequence index information and/or segment identification information may be ignored.
In some implementations, in addition to the layout information, to enable the feature extraction model 120 to extract more accurate semantic feature representations for determining the reading order of text elements, visual information related to the document 162 may also be determined and provided as the input of the feature extraction model 120. In some implementations, the visual information may indicate visual appearances of respective text elements presented in the document 162. The visual information may also be characterized by a feature representation of vector information, as the input of the feature extraction model 120. In some examples, a feature map extractor 320 may be used to extract the embedding representation 334 of visual information. In some implementations, if the document 162 is an image-format file, it may be directly used as the input of the feature map extractor 320. If the document 162 is not an image-format file, the document 162 may be converted into an image- format file as the input of the feature map extractor 320. The feature map extractor 320 may include a machine learning model configured to be adapted to process images, for example a convolutional neural network CNN, etc., to extract one or more feature maps from the image- format file to represent the visual information. In some implementations, after the feature map is extracted from the feature extractor 320, the feature map may be pooled, e.g., be subjected to an average pooling operation, to obtain a feature map identical with the dimensions (e.g., width W and height H) of the document 162. Since the feature map is in a two-dimensional form, the feature map may be flattened into a one-dimensional visual embedding representation sequence, e.g., a visual embedding representation sequence with a total length of W*H, to provide the embedding representation 334 as visual information. For example, the sequence may include a plurality of visual embedding representations, and each visual embedding representation may correspond to the feature information of a portion of the feature map. Since the two-dimensional space of the feature map corresponds to the two- dimensional space of the document 162, the feature information of this portion may also correspond to the feature information in the document 162. In some implementations, the visual embedding representation may also be processed through a linear mapping layer to unify the dimensions. For better understanding, Fig.4 illustrates examples of various types of embedding representations. As shown in Fig.4, after the feature extractor 340 extracts the feature map 412 from the image- format file 402 corresponding to the document 162, the feature map 412 is flattened into four visual embedding representations, which may respectively correspond to four portions V1 to V4 in the image-format file 402 to characterize visual features of these four portions, respectively. The four visual embedding representations may form a one-dimensional visual embedding representation sequence to serve as the embedding representation 334 of the visual information. In some implementations, similar to the text element, sequence index information and segment identification information of the visual embedding representation may also be considered. Correspondingly, for the i th visual embedding representation 334, the constructed embedding representation includes the following content: where represents the i th embedding representation 334 of the visual information obtained from the change of the feature map. represents the embedding representation corresponding to the sequence index information of the i th visual embedding representation. If the feature map is divided into M visual embedding representations, the value of i may be in a range of 1 to M. represents the embedding representation of segment identification information. Each visual embedding representation may be assigned as a segment of a visual type, such as [C], the segment type being different from the segment type of the text element. In some implementations, for each visual embedding representation, the embedding representation of the layout information may also be determined. The embedding representation of the layout information may be represented by relative spatial positions and sizes of each visual embedding representation in corresponding portions (V1 to V4 portions in Fig. 4) in the document 162 or image-format file 402, in a representation manner similar to the embedding representation of the layout information of the text elements. In some examples, in a case where the layout information of the text element and the visual embedding representation is considered at the same time, the embedding representation 331 of the layout information may be represented as follows: where W and H represent a total width and a total height of the document 162, and L represents a length of the input sequence S corresponding to the text element. In the above Equation (4), the x- axis coordinate information (x0, x1) and width w are used as a triple to construct an embedding representation, and the y-axis coordinate information (y0, y1) and height h are used as a triple to construct another embedding representation, and then the two embedding representations are concatenated into the embedding representation of the layout information of the i th text element. It should be appreciated that the layout information of the text element may also be converted into the corresponding embedding representation in other ways, as long as the determined embedding representation may distinguish different layout information. In some implementations, if the input sequence S is constructed by adding predetermined symbols such as [CLS], [SEP] and [PAD], the embedding representation of the layout information corresponding to these predetermined symbols may be determined as an all-zero vector, for example (0 , 0, 0, 0, 0, 0). In some implementations, considering the above-mentioned types of embedding representations, the embedding representation (determined by the above Equation (2)) constructed for text elements and the embedding representation (determined by the above Equation (3)) constructed for visual information may be concatenated into a unified sequence X , and then the sequence X is combined with the embedding representation 331 of the layout information, thereby obtaining an input embedding representation of the feature extraction model 120. The determination of the input embedding representation may be represented as follows: For better understanding, Fig.4 illustrates how these different and similar embedding information is combined into the input embedding representation of the feature extraction model 120. As shown in Fig.4, the visual embedding representation 334 (represented by V1 to V4) and the embedding representation 331 of the text element (represented by [CLS], T1 to T6, and [SEP]) are concatenated. In addition, for each text element and each visual embedding representation, there exist a corresponding embedding representation 331 of the layout information and a corresponding embedding representation 332 of the index sequence information. These embedding representations may be summed up as per text elements and as per visual embedding representations, to form a final input embedding representation . The above shows an example of the input when the feature extraction model 120 performs feature extraction. It should be appreciated that in different implementations, one or more of the abovementioned visual information, sequence index information and segment identification information may be omitted. The feature extraction model 120 is configured to determine the semantic feature representations corresponding to respective text elements in the text sequence 312 based on the input embedding representations. For example, for the text elements w1 to w7 shown in Fig.2, the semantic feature representations h1 to h7 of respective text elements may be determined. In some implementations, the feature extraction model 120 may determine the semantic representation based on a self-attention mechanism. According to the self-attention mechanism, an importance degree of each text element to other text elements may be determined. In this way, when the semantic feature representation of a certain text element is extracted, more attention is paid to an important text element, and less attention is paid to an unimportant text element. In some implementations, an attention weight of the text element i with respect to the text element j in the text sequence 310 may be determined based on the embedding representations of the two text elements. In some implementations, the subject matter described herein further proposes a spatially perceptible self-attention mechanism, which determines the attention weight with the spatial position relationship of the text element. Specifically, the attention weight of one text element with respect to another text element may be determined based on the relative spatial positions of the two text elements in the document 162. This relative spatial positions includes the spatial positions (x0, y0) and (x1, y1) discussed above regarding the layout information. In some implementations, upon determining the attention weight, the coordinates of a point in the bounding box may be considered, for example (x0, y0) or (x1, y1). The relative spatial positions here may characterize the relative positioning relationship of two text elements in the two- dimensional space of the document 162. Alternatively or additionally, this may also be represented based on relative ranking positions of the two text elements in the text sequence 310, e.g., represented by sequence ranking indices of the two text elements. The relative ranking positions may characterize the relative positioning relationship of the two text elements in the one- dimensional space of the text sequence 310. With the position information being taken into consideration, the relative spatial positions and relative ranking positions of the text element i and the text element j may be used to determine an offset for adjusting a basic attention weight determined based on the embedding representations of the two text elements i and j. A process of determining the offset from the relative spatial positions and the relative ranking positions may be a training process to learn corresponding parameter values for determining an accurate offset of different relative spatial positions and relative ranking positions. The attention weights of text element i to text element j may be determined as follows: where represents the basic attention weight determined based on the embedding representations of the two text elements i and j, represents the offset determined based on the coordinate information of the x-axis in the relative spatial positions of the text element i and the text element j, represents an offset determined based on the coordinate information of the y-axis in the relative spatial positions of the text element i and the text element j, and represents the offset determined based on the relative ranking positions. For a given text element in the text sequence 310, the attention weight of the given text element with respect to another text element may be determined. The attention weight may be used to weight the embedding representation of another text element to determine the semantic feature representation of the given text element. This may be represented as follows: where represents the semantic feature representation of the text element i, represents the embedding representation of the j th text element, and represents the parameter value of the model. In some implementations, the feature extraction model 120 may include a plurality of network layers, such as Transformer network layers. An attention head in each network layer or each transformer network layer may use the self-attention mechanism to process the input of the network layer and output an output of the network layer. At each network layer, the attention weight may be determined based on the above Equations (5) and (6). In such an implementation, the basic attention weight is determined based on the input embedding representation for text element i and text element j in the network layer (for the first network layer) or an intermediate feature representation output by a previous network layer. In addition, in the above Equation (6), represents the input embedding representation of the jth text element (for the first network layer) or the intermediate feature representation output by the previous network layer. The semantic feature representations of text elements (for example, h1 to h7 shown in Fig.3) may be provided to the order determination model 150 for determining the reading order of the plurality of text elements in the document 162. In some implementations, a reading order index of each text element may be determined. For example, if there are a total of N text elements, the reading order index performs indexing from 1 to N to indicate the reading order of these text elements. In some implementations, the feature extraction model 120 and the order determination model 150 may be based on a sequence-to-sequence (seq2seq) model architecture. The order determination model 150 may sequentially determine the reading order index of each text element one by one starting from an initial text element of the text sequence 310. In some implementations, the feature extraction model 120 may be based on a Transformer network configuration, a BERT network configuration, or any other suitable machine learning or neural network configuration. The embodiments of the present disclosure are not limited in this respect. In the above Equation (8), wk represents the k th text element in the text sequence 312. In the above Equation (8), i represents the reading order index. For example, there are a total of N text elements, and the positions of the reading order may be indexed from 1 to N. In the above Equation (8), represents a probability that the text element wk has the reading order index i in the reading order in a case where the reading order indices of the text elements before the text element wk in the given text sequence 312 are known. In addition, in the above Equation (1), and represents the embedding representations of the i th text element and the j th text element, represents the semantic feature representation output from the feature extraction model 120 in the k th step, and represents an offset value at the k th step. The feature determination model 140 sequentially determines starting from the first text element of the text sequence one by one, and in each step, judges the position of a text element in the reading order. The reading order of each text element in the document may be determined through the reading order detection model. The information about the reading order of the text elements may also be used to guide more document understanding tasks for the document. In some implementations, the reading order of regions in the document 162 may also be determined based on the reading order indices of the text elements. Fig.5 illustrates an example of labeling a text reading order in the document 162 in accordance with some embodiments of the present disclosure. As shown in Fig.5, it is desirable to mark the reading order of the respective regions, for example, the reading order of respective grid regions in the document 162. The text element presented in each grid region may be determined. The smallest reading order index among all reading order indexes of text elements in each grid region is determined as the order index of the grid region. Then, the order index of the plurality of grid regions in the document 162 may be ranked, and the reading order index of each grid region may be labeled in an ascending order. Fig. 5 shows the reading order index of each grid cell in numerical form. In some implementations, the labeling of the reading order in the document may also be presented to the user, or may be provided for other analysis tasks of the document. The embodiments of the present disclosure are not limited here. Implementation of the training of the reading order detection model In order to train the reading order detection model 140 including the feature extraction model 120 and the order determination model 150, training data including the sample document set 122 and order labeling information 124 shown in Fig. 1 need to be obtained. In the training process, the parameter values of the feature extraction model 120 and the order determination model 150 may be initialized. In some implementations, the feature extraction model 120 may undergo a pre- training process, and thus have pre-trained parameter values as initial values. Fig. 6 illustrates an example architecture for training the reading order detection model in accordance with some embodiments of the present disclosure, which may be implemented in the task-specific training device 130. In the training process, sample documents in the sample document set 122 will be provided as input. The feature extraction model 120 may extract a text sequence from the sample documents according to the processing procedure described above, determine the layout information, and determine the semantic feature representations of the text elements in the sample documents based on the text sequence and the layout information. At this time, the feature extraction model 120 may extract the semantic feature representations by using the current parameter values. Similarly, the order determination model 150, using its own current parameter values, determines a predicted reading order of text elements in the sample document based on the semantic feature representations extracted by the feature extraction model 120. The task-specific training device 130 includes a parameter update module 610, which is configured to determine a ground-truth reading order of text elements in the currently-input sample document from the order labeling information 124, and determine how or whether to update the parameter values of the feature extraction model 120 and the order determination model 150 based on a difference between the predicted reading order and the ground-truth reading order. A purpose of upgrading the parameter values of the feature extraction model 120 and the order determination model 150 is to reduce the difference between the predicted reading order and the ground-truth reading order. The updated parameter values determined by the parameter update module 610 are configured to the feature extraction model 120 and the order determination model 150 for use in the next iteration. The parameter values of the model may be continuously determined in an iterative manner, so that the model can learn how to correctly detect the reading order of text elements, to achieve a convergence target. Upon training with respect to the reading order detection task, the order labeling information about the text elements in the document needs to be used as the supervision information to guide the training of the model. Considering that the cost required for manual labeling of the order of the text elements in a large number of sample documents is too high, the embodiments of the subject matter described herein further propose a simple and efficient way to obtain order labeling information. In some implementations, in the stage of collecting training data, editable text documents may be collected as sample documents, and such editable text documents may have corresponding format information for indicating the ground-truth reading order of text elements in the document. For example, for a Word document in “.docx” format, an Office XML (Extensible Markup Language) code may be read to obtain the ground-truth reading order of the text elements presented on each page in the document. Other document formats that can extract information about the reading order may also be applied. Since the input of the feature extraction model 120 still needs to determine the layout information and possible visual information, the editable text document may be converted into a format suitable for extracting the layout information or visual information. In some implementations, in order to extract the layout information, a Word document in the “.docx” format may be converted into a vector file, and a relative spatial position and size of each text element may be located from the vector file. The vector file may usually be in a format with fixed content, so that the layout information may be extracted better. An example of the vector file includes a PDF (Portable Document Format) file, and a SVG (Scalable Vector Graphics) file. For the page of a sample document, the input information may be extracted, where includes the text element and the layout information of the text element, for example, the relative spatial position represented by the vertex coordinates of the bounding box, and possibly further include the size represented by the width and height of the bounding box. In some implementations, in order to extract visual information, a Word document in the “.docx” format may be converted into an image-format file as the input of the feature map extractor 340. In some implementations, the same text may occur many times in a document. For example, a document with English texts may include texts such as “the” and “a” that occur at different positions. The same texts that occur at different positions are usually extracted as different text elements in the text sequence. Various format conversions of the sample document need to be performed to prepare the text sequence, the layout information and the visual information. In this process, different text elements need to be accurately distinguished so that the layout information, visual information, etc. can be aligned with the text elements. In order to be able to distinguish text elements representing the same text, an index of occurrence order may be assigned to each text element in the text sequence. For example, if the text sequence extracted from the sample document is [the, car, hits, the, bus], then the index indicating the occurrence order may be determined as [0, 0, 0, 1, 0], where for the text element “the” occurring twice, the text element “the” at the first occurrence has an index 0 to indicate the first occurrence of “the”, and the text element “the” at the second occurrence has an index 1 to indicate the first occurrence of “the”. Further, in the vector file corresponding to the sample document, a plurality of text elements representing the same text are labeled with different colors. In this way, when the layout information (for example, the relative spatial position and/or size of the text element) is determined from the vector file, which text element in the sample document corresponding to the layout information may be determined. In some implementations, an automated tool may be used to change the colors of the text element in the sample document in an editable format, and then the colored document may be converted into the vector file. In some implementations, the color of each text element may be determined based on the index of the occurrence order of the text element. For example, all text elements in the sample document may be first changed to the same color, such as black. For a given text element, its color (taking the RGB color space as an example) may be determined as follows: where i represents the index of the occurrence order of the text element, represents a bitwise “and” operation, r, g and b represent color values determined for three color channels of RGB, and represents a mapping function for mapping the determined color values to the corresponding RGB color channels. Through the above index assignment and color labeling, for each sample document, the editable sample document and the corresponding vector file pair may be determined, and a mapping may be established for each text element, as shown in Fig.7. A text sequence including a plurality of text elements is extracted from the editable text document 710, and in the corresponding vector file 720, the plurality of text elements representing the same text are labeled with different colors. As such, the mapping for each text element may be represented as: In the above formula (9), is the information that may be extracted from the editable text document 710, including the text w represented by the text element (for example, the word itself), and i represents the index of the occurrence order of the text w in the text sequence. In the above formula (9), is the information extracted from the vector file 720, including the text w’ represented by the text element, the color c of the text element labeled in the document 720, the relative spatial position of the text element, and the total width W and height H of the document. If it is found in the two documents 710 and 720 that two text elements represent the same text, that is, w=w’, it is possible to determine corresponding colors according to the mapping function and the index i of the occurrence order of the text elements, thereby accurately determining a text element in the editable document 710 corresponds to which text element in the editable text document 720, and then regard the relative spatial position determined in the editable text document 710 as the spatial position of the text file. In the use of sequence-to-sequence (seq2seq) network configurations such as Transformer network configuration, self-attention mask should also be considered upon training. In some implementations, the self-attention mask may be determined in a way that text elements in an input source sequence may pay attention to one another, and attention is only paid to elements located on the left in an overall sequence composed of the source sequence and a target sequence, and attention is not paid to elements in the output target sequence. The self-attention mask may be represented as: where src represents the input source sequence, i.e., the input text sequence 310. The target sequence includes the output sequence of the order determination model 150, i.e., a sequence indicating the reading order. In the above Equation (10), 1 indicates that the element i in the sequence is allowed to pay attention to the element j, and 0 indicates that the attention is not allowed. When the attention is not allowed, the another element j will not affect the judgment of element i. As shown in Fig.8, upon training, a sequence 810 is composed of [SOS]S1[EOS]S2[EOS], where S 1 is the source sequence, S 2 is the target sequence, [SOS] represents the target sequence, and [EOS] Indicates the end of a sequence. According to the above Equation (10), a self-attention mask 820 may be set. Upon training, the elements in the two sequences will be randomly masked. If one element in the source sequence is masked, the model may pay attention to all elements in the source sequence; if one element in the target sequence is masked, the model can only pay attention to current elements in the source sequence and the target sequence and other elements between the elements. The above self-attention mask is often used in the training of natural language models, so it will not be described in more detail any longer. Implementation of pre-training of the feature extraction model As mentioned above, the feature extraction model 120 may be pre-trained to be capable of extracting semantic feature representations of text elements in a document. After the parameter values of the feature extraction model 120 are determined by the pre-training, the parameter values are further fine-tuned for a subsequent specific document understanding task, such as the reading order detection task. After the pre-training and upon training with respect to a specific task, the parameter values of the feature extraction model 120 may converge to a target faster, thereby improving training efficiency. In the pre-training stage, since there is no specific task, there is no definite supervision information regarding the task to guide the training. In some implementations, training is completed by setting some self-supervised tasks. This will be discussed in detail below. Fig. 9 illustrates an example architecture of pre-training of the feature extraction model 120 in accordance with some embodiments of the present disclosure. In the pre-training stage, the pre- training may be performed with respective sample documents in the sample document set 112. The pre-training process for example may be performed by the pre-training device 110. In Fig. 9 is shown the input sequence of the feature extraction model 120 determined from the sample document 902 (which may be included in the sample document set 112). The form of the input sequence at this time is similar to the input sequence discussed above with reference to Fig. 3 and Fig.4. For example, the text recognition system 310 may be used to extract a text sequence from the sample document 902, and the feature map extractor 340 may be used to extract visual information. The feature extraction model 120 includes a visual embedding representation 934, an embedding representation 930 of the text element, an embedding representation 931 of the layout information, an embedding representation 932 of sequence index information, and an embedding representation 933 of segment identification information. In some implementations, the self-supervision task for the pre-training includes a text masking- reconstruction task 950. Specifically, after the text sequence is extracted from the sample document 902 and before the text sequence is provided to the feature extraction model 120, a preprocessing module 914 may randomly mask one or more text elements in the text elements. For example, the text element is replaced with a predetermined symbol [MASK]. In Fig.9, it is assumed that in the text sequence T1 to T7, the text elements T2 and T4 are masked. Correspondingly, the embedding representation 930 of the text element includes an embedding representation of [MASK], e.g., an all-zero vector. The self-supervision task requires the feature extraction model 120 to be able to extract the semantic feature representations of the masked text elements, so that the masked text elements can be reconstructed according to the semantic feature representations. In some implementations, the corresponding text elements may be determined using a language model (LM), based on the semantic feature representation corresponding to the masked text element output from the feature extraction model 120, for example, the semantic feature representation corresponding to the symbol [MASK] in the output semantic feature representation 940. In some implementations, the parameter values of the feature extraction model 120 may be updated based on a difference between the reconstructed text element and the ground-truthly- masked text element, so that the feature extraction model 120 can reduce such a difference based on the updated parameter values. In some implementations, the self-supervision task additionally or alternatively includes a text- image alignment task 952. Specifically, after the text sequence is extracted from the sample document 902, the preprocessing module 914 may randomly select one or more text elements from the text sequence. The preprocessing module 914 may mask the selected text elements from the image-format file 912 corresponding to the sample document 902, for example by overlapping black blocks, so that the corresponding text document cannot be seen from the image-format file 912 visually. In addition, such a masking manner is also recorded to indicate which text element in the text sequence is masked and which text element is not masked. For example, in the example of Fig. 9, it is assumed that the text elements T1 and T3 corresponding to the text line 1 in the image-format file 912 are masked, and other text elements are not masked. In some implementations, since the resolution of image-format file is limited, text elements may be masked as per text lines. In a case where partial region is masked, a feature map 922 characterizing visual information is extracted from the image-format file 912. At this time, the visual presentation corresponding to the masked text element might not be directly extracted in the feature map 922. The corresponding visual embedding representation 934 and other embedding representations are provided to the feature extraction model 120 for extracting the semantic feature representation of the text element. The text-image alignment task 952 requires the feature extraction model 120 to be able to extract the semantic feature representation of the text element, so that it can be determined whether the corresponding text element is masked in the image-format file 912 according to the semantic feature representation. In some implementations, a cross-modal alignment model (LM) may be used to estimate the masking information based on the semantic feature representations corresponding to the respective text elements output from the feature extraction model 120, for example, the semantic feature representations corresponding to the respective text elements in the output semantic feature representation 940, to indicate whether the corresponding text elements are masked in the image- format file 912. In some implementations, the text elements masked in the text masking- reconstruction task 950 may not be considered. In some implementations, the parameter values of the feature extraction model 120 may be updated based on a difference between the masking information determined in the preprocessing stage and the estimated masking information, so that the feature extraction model 120 can reduce such a difference based on the updated parameter values. The text-image alignment task 952 may enable the feature extraction model 120 to learn, from the visual information of the document, more features to accurately characterize the semantics of the text element. In some implementations, as shown in Fig.9, the output of the feature extraction model 120 may further include a visual feature representation. The visual feature representation may characterize information learned from the input embedding representation and related to the visual feature of the document. In this case, the self-supervision task for the pre-training further additionally or alternatively includes a text-image matching task 954. Such a task also promotes the pre-training of the model through the information in the two modalities: image and text. According to the text-image matching task 954, it may be determined whether the image-format file match the sample document based on the visual feature representation 941 output by the feature extraction model. As shown in Fig. 9, the visual representation output by the feature extraction model 120 may include four visual feature representation portions corresponding to the feature embedding representations V1 to V4, respectively. In some implementations, the output feature representation corresponding to the input starting symbol [CLS] may be considered as a sum of the four visual feature representation portions, and therefore may be provided for determination of the image-format file and the sample document by the text-image matching model. In the example of Fig. 9, if what is input is the sample document 902 and the image-format file 912 converted from the sample document 902, a ground-truth match result should be that the two match each other. If the match result determined by the text-image matching model based on the extracted visual feature representation is not matching, it is necessary to continue to update the feature extraction model 120 so as to determine a correct match result. In some implementations, for the text-image matching task 954, a negative sample may also be constructed, which includes a sample document and a randomly-selected image-format file that do not match each other. Then, the text-image matching model determines the corresponding match result based on the visual feature representation output by the feature extraction model 120. Through the text-image alignment task 952 and the text-image matching task 954, the feature extraction model 120 may be enabled to have a capability of extracting features across modalities (image and text). It should be appreciated that other self-supervision tasks may also be involved for pre-training the feature extraction model 120. Only one or some of the three self-supervision tasks discussed above may be considered. The implementations of the present disclosure are not limited in this respect. After the pre-training process discussed with reference to Fig.9, the pre-trained feature extraction model 120 may be used for various downstream tasks, and is not limited to the reading order detection task discussed above. For example, other downstream tasks may include a receipt understanding task, a document classification tasks, a table information extraction task, and so on. In particular, both the visual feature representation and/or the semantic feature representation output by the feature extraction model 120 may all be used to process the downstream tasks. Since the feature extraction model 120 has learned how to better extract features in the pre-training stage, it may quickly converge in training for a specific task and can demonstrate very good performance. Example process Fig. 10 illustrates a flow chart of a process 1000 of reading order detection in accordance with some embodiments of the present disclosure. The process 1000 may be implemented by the model application system 120. At block 1010, the model application system 120 determines a text sequence and layout information presented in the document. The text sequence includes a plurality of text elements, and the layout information indicates a spatial layout of the plurality of text elements in the document. At block 1020, the model application system 120 generates a plurality of semantic feature representations corresponding to the plurality of text elements based at least on the text sequence and layout information. At block 1030, the model application system 120 determines a reading order of the plurality of text elements in the document based on the plurality of semantic feature representations. In some implementations, the layout information includes relative spatial positions and sizes of the plurality of text elements in the document. In some implementations, generating the plurality of semantic feature representations comprises: for a first text element in the text sequence, determining an attention weight for the first text element with respect to a second text element in the text sequence based at least on one of the following: a relative spatial positioning of the first text element with respect to the second text element in the document, and a relative ranking position of the first text element with respect to the second text element within the text sequence, the attention weight indicating an importance degree of the second text element to the first text element; and determining a semantic feature representation of the first text element by weighting an embedding representation of the second text element with the determined attention weight. In some implementations, generating the plurality of semantic feature representations comprises: determining an image-format file corresponding to the document; determining visual information from the image-format file, the visual information indicating visual appearances of the plurality of text elements presented in the document; and generating the plurality of semantic feature representations further based on the visual information. In some implementations, generating the plurality of semantic feature representations comprises: converting the text sequence and the layout information into a first embedding representation and a second embedding representation, respectively; concatenating the first embedding representation and the second embedding representation, to obtain a concatenated embedding representation; and applying the concatenated embedding representation to a trained feature extraction model to generate the plurality of semantic feature representations. Fig. 11 illustrates a flow chart of a process 1100 for model training in accordance with some embodiments of the present disclosure. The process 1100 may be implemented by the model training system 110. At block 1110, the model training system 110 determines a text sequence, layout information and order labeling information presented in a first sample document. The text sequence includes a first set of text elements, the layout information indicates a spatial layout of the first set of text elements in the first sample document, and the order labeling information indicates a ground-truth reading order of the first set of text elements in the first sample document. At block 1120, the model training system 110 generates, using a feature extraction model, respective semantic feature representations of the first set of text elements at least based on the text sequence and the layout information. At block 1130, the model training system 110 determines, using an order determination model, a predicted reading order of the first set of text elements in the first sample document based on the semantic feature representations. At block 1140, the model training system 110 trains the feature extraction model and order determination model based on a difference between the predicted reading order and the ground-truth reading order. In some implementations, the first sample document includes an editable text document. In some implementations, determining the order labeling information includes: determining format information corresponding to the editable text document, the format information at least indicating the ground-truth reading order of the first set of text elements. In some implementations, determining the layout information includes: determining a vector file corresponding to the first sample document; and determining the layout information of the first set of text elements from the vector file. In some implementations, a plurality of text elements that occur at different positions in the first sample document and represent a same text are assigned with different indices, and the plurality of text elements are labeled with different colors in the vector file, for each text element, the color labeled to the text element being determined based on the index assigned to the text element. In some implementations, determining the layout information of the first set of text elements from the vector file includes: assigning, based on the indices and the colors assigned to the plurality of text elements, layout information determined from the vector file to the plurality of text elements extracted from the first sample document. In some implementations, generating the semantic feature representations includes: determining visual information from a first image-format file corresponding to the first sample document, the visual information indicating visual appearances of the first set of text elements presented in the first sample document; and generating, using the feature extraction model, the semantic feature representations further based on the visual information. In some implementations, the method further comprises obtaining the pre-trained feature extraction model by: determining a second image-format file corresponding to a second sample document, the second sample document comprising a second set of text elements; generating, by masking at least one of the second set of text elements in the second image-format file, first masking information to indicate that the at least one text element is masked and other text elements of the second set of text elements are not masked; determining, using the feature extraction model, respective semantic feature representations of the second set of text elements; determining second masking information based on the respective semantic feature representations of the second set of text elements, the second masking information indicating whether respective text elements of the second set of text elements are masked or not; and pre-training the feature extraction model based on a difference between the first masking information and the second masking information. In some implementations, the feature extraction model is further configured to generate a first visual feature representation of the first image-format file based on the text sequence, the layout information, and the visual information. In some implementations, the method further includes obtaining the feature extraction model that is pre-trained by: determining a third sample document, a third image-format file, and match labeling information, the match labeling information indicating whether the third image-format file matches with the third sample document or not; generating, using the feature extraction model, a second visual feature representation of the third image-format file based on the third sample document and the third image-format file; determining, based on the second visual feature representation, a match result indicating whether the third image-format file matches with the third sample document; and pre-training the feature extraction model based on a difference between the match result and the match labeling information. Example device Fig. 12 illustrates a block diagram of a computing device 1200 that can achieve some embodiments of the present disclosure. It would be appreciated that the computing device 100 as shown in Fig. 12 is merely provided as an example, without suggesting any limitation to the functionalities and scope of embodiments of the present disclosure. The computing device 1200 may be used to implement an image encoding and/or image decoding processes according to embodiments of the subject matter described herein. As shown in Fig.12, the computing device 100 includes a computing device 1200 in form of a general-purpose computing device. Components of the computing device 1200 may include, but are not limited to, one or more processors or processing units 1210, a memory 1220, a storage device 1230, one or more communication units 1240, one or more input devices 1250, and one or more output devices 1260. In some implementations, the computing device 1200 may be implemented as any user terminal or server terminal with computing capability. The server terminal may be any server, large-scale computing device, and the like provided by a variety of service providers. The user terminal may, for example, be any type of mobile terminal, fixed terminal, or portable terminal, including a mobile phone, station, unit, device, multimedia computer, multimedia tablet, Internet node, communicator, desktop computer, laptop computer, notebook computer, netbook computer, tablet computer, personal communication system (PCS) device, personal navigation device, personal digital assistant (PDA), audio/video player, digital camera/video camera, positioning device, TV receiver, radio broadcast receiver, E-book device, gaming device, or any combination thereof, including the accessories and peripherals of these devices, or any combination thereof. It is also anticipated that the computing device 1200 can support any type of interface to a user (such as “wearable” circuitry and the like). The processing unit 1210 can be a physical or virtual processor and may execute various processes based on the programs stored in the memory 1220. In a multi-processor system, a plurality of processing units execute computer-executable instructions in parallel so as to enhance parallel processing capability of the computing device 1200. The processing unit 1210 may also be known as a central processing unit (CPU), a microprocessor, a controller, or a microcontroller. The computing device 1200 usually includes various computer storage medium. The computer storage medium may be any available medium accessible by the computing device 1200, including but not limited to, volatile and non-volatile medium, or detachable and non-detachable medium. The memory 1220 can be a volatile memory (for example, a register, cache, Random Access Memory (RAM)), non-volatile memory (for example, a Read-Only Memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), flash memory), or any combination thereof. The memory 1220 may include a processing module 1222. This program module is configured to perform the functionalities of various implementations described herein. The processing module 1222 may be accessed and run by the processing unit 1210 to implement the corresponding functions. The storage device 1230 may be any detachable or non-detachable medium and may include machine-readable medium that can be used for storing information and/or data and are accessible by the computing device 1200. The computing device 1200 may further include additional detachable/non-detachable, volatile/non-volatile memory medium. Although not shown in Fig.12, there may be provided a disk drive for reading from or writing into a detachable and non-volatile disk, and an optical disk drive for reading from and writing into a detachable non-volatile optical disc. In such case, each drive may be connected to a bus (not shown) via one or more data medium interfaces. The communication unit 1240 implements communication with another computing device via the communication medium. In addition, the functions of components in the computing device 1200 may be implemented by a single computing cluster or a plurality of computing machines that can communicate with each other via communication connections. Therefore, the computing device 1200 may operate in a networked environment using a logic connection with one or more other servers, network personal computers (PCs), or further general network nodes. The input device 1250 may include one or more of a variety of input devices, such as a mouse, keyboard, tracking ball, voice-input device, and the like. The output device 1260 may include one or more of a variety of output devices, such as a display, loudspeaker, printer, and the like. By means of the communication unit 1240, the computing device 100 can further communicate with one or more external devices (not shown) such as storage devices and display devices, one or more devices that enable the user to interact with the computing device 1200, or any devices (such as a network card, a modem and the like) that enable the computing device 1200 to communicate with one or more other computing devices, if required. Such communication may be performed via input/output (I/O) interfaces (not shown). In some implementations, as an alternative of being integrated on a single device, some or all components of the computing device 1200 may also be arranged in the form of cloud computing architecture. In the cloud computing architecture, the components may be provided remotely and work together to implement the functionalities described in the subject matter described herein. In some implementations, cloud computing provides computing, software, data access and storage service, which will not require end users to be aware of the physical locations or configurations of the systems or hardware provisioning these services. In various implementations, the cloud computing provides the services via a wide area network (such as Internet) using proper protocols. For example, a cloud computing provider provides applications over the wide area network, which can be accessed through a web browser or any other computing components. The software or components of the cloud computing architecture and corresponding data may be stored in a server at a remote position. The computing resources in the cloud computing environment may be aggregated or distributed at locations of remote data centers. Cloud computing infrastructure may provide the services through a shared data center, though they behave as a single access point for the users. Therefore, the cloud computing infrastructure may be utilized to provide the components and functionalities described herein from a service provider at remote locations. Alternatively, they may be provided from a conventional server or may be installed directly or otherwise on a client device. The computing device 1200 can be used to implement reading order detection and/or model training in various embodiments of the present disclosure. The computing device 1200, e.g., the memory 1220 includes the processing module 1222 to perform reading order detection and/or model training in various embodiments of the present disclosure. Example implementations In a first aspect, the subject matter described herein provides a computer-implemented method. The method comprises: determining a text sequence and layout information presented in a document, the text sequence comprising a plurality of text elements, the layout information indicating a spatial layout of the plurality of text elements in the document; generating a plurality of semantic feature representations corresponding to the plurality of text elements based at least on the text sequence and the layout information; and determining a reading order of the plurality of text elements in the document based on the plurality of semantic feature representations. In some implementations, the layout information includes relative spatial positions and sizes of the plurality of text elements in the document. In some implementations, generating the plurality of semantic feature representations comprises: for a first text element in the text sequence, determining an attention weight for the first text element with respect to a second text element in the text sequence based on at least one of the following: a relative spatial positioning of the first text element with respect to the second text element in the document, and a relative ranking position of the first text element with respect to the second text element in the text sequence, the attention weight indicating an importance degree of the second text element to the first text element; and determining a semantic feature representation of the first text element by weighting an embedding representation of the second text element with the determined attention weight. In some implementations, generating the plurality of semantic feature representations comprises: determining an image-format file corresponding to the document; determining visual information from the image-format file, the visual information indicating visual appearances of the plurality of text elements presented in the document; and generating the plurality of semantic feature representations further based on the visual information. In some implementations, generating the plurality of semantic feature representations comprises: converting the text sequence and the layout information into a first embedding representation and a second embedding representation, respectively; concatenating the first embedding representation and the second embedding representation, to obtain a concatenated embedding representation; and applying the concatenated embedding representation to a trained feature extraction model to generate the plurality of semantic feature representations. In a second aspect, the subject matter described herein provides an electronic device. The electronic device comprises: a processor; and a memory coupled to the processor and having instructions stored thereon, the instructions, when executed by the processor, causing the device to perform acts comprising: determining a text sequence and layout information presented in a document, the text sequence comprising a plurality of text elements, the layout information indicating a spatial layout of the plurality of text elements in the document; generating a plurality of semantic feature representations corresponding to the plurality of text elements based at least on the text sequence and the layout information; and determining a reading order of the plurality of text elements in the document based on the plurality of semantic feature representations. In some implementations, the layout information includes relative spatial positions and sizes of the plurality of text elements within the document. In some implementations, generating the plurality of semantic feature representations comprises: for a first text element in the text sequence, determining an attention weight for the first text element with respect to a second text element in the text sequence based on at least one of the following: a relative spatial positioning of the first text element with respect to the second text element in the document, and a relative ranking position of the first text element with respect to the second text element in the text sequence, the attention weight indicating an importance degree of the second text element to the first text element; and determining a semantic feature representation of the first text element by weighting an embedding representation of the second text element with the determined attention weight. In some implementations, generating the plurality of semantic feature representations comprises: determining an image-format file corresponding to the document; determining visual information from the image-format file, the visual information indicating visual appearances of the plurality of text elements presented in the document; and generating the plurality of semantic feature representations further based on the visual information. In some implementations, generating the plurality of semantic feature representations comprises: converting the text sequence and the layout information into a first embedding representation and a second embedding representation, respectively; concatenating the first embedding representation and the second embedding representation, to obtain a concatenated embedding representation; and applying the concatenated embedding representation to a trained feature extraction model to generate the plurality of semantic feature representations. In a third aspect, the subject matter described herein provides a computer program product being tangibly stored in a computer storage medium and comprising computer-executable instructions, the computer-executable instructions, when executed by a device, causing the device to perform one or more implementations of the method according to the first aspect. In a fourth aspect, the subject matter described herein provides a computer readable medium having computer-executable instructions stored thereon, the computer-executable instructions, when executed by a device, causing the device to perform one or more implementations of the method according to the first aspect. In a fifth aspect, the subject matter described herein provides a computer-implemented method. The method comprises: determining a text sequence layout information and order labeling information presented in a first sample document, the text sequence comprising a first set of text elements, the layout information indicating a spatial layout of the first set of text elements in the first sample document, the order labeling information indicating a ground-truth reading order of the first set of text elements within the first sample document; generating, using a feature extraction model, respective semantic feature representations of the first set of text elements at least based on the text sequence and the layout information; determining, using an order determination model, a predicted reading order of the first set of text elements within the first sample document based on the semantic feature representations; and training the feature extraction model and order determination model based on a difference between the predicted reading order and the ground-truth reading order. In some implementations, the first sample document comprises an editable text document. In some implementations, determining the order labeling information comprises: determining format information corresponding to the editable text document, the format information at least indicating the ground-truth reading order of the first set of text elements. In some implementations, determining the layout information comprises: determining a vector file corresponding to the first sample document; and determining the layout information of the first set of text elements from the vector file. In some implementations, a plurality of text elements that occur at different positions in the first sample document and represent a same text are assigned with different indices, and the plurality of text elements are labeled with different colors in the vector file, the color with which each text element is labeled being determined based on the index assigned to the text element. In some embodiments, determining the layout information of the first set of text elements from the vector file comprises: assigning, based on the indices and the colors assigned to the plurality of text elements, layout information determined from the vector file to the plurality of text elements extracted from the first sample document. In some implementations, generating the semantic feature representations comprises: determining visual information from a first image-format file corresponding to the first sample document, the visual information indicating visual appearances of the first set of text elements presented in the first sample document; and generating, using the feature extraction model, the semantic feature representations further based on the visual information. In some implementations, the method further comprising obtaining the pre-trained feature extraction model by: determining a second image-format file corresponding to a second sample document, the second sample document comprising a second set of text elements; generating, by masking at least one of the second set of text elements in the second image-format file, first masking information to indicate that the at least one text element is masked and other text elements of the second set of text elements are not masked; determining, using the feature extraction model, respective semantic feature representations of the second set of text elements; determining second masking information based on the respective semantic feature representations of the second set of text elements, the second masking information indicating whether respective text elements of the second set of text elements are masked or not; and pre-training the feature extraction model based on a difference between the first masking information and the second masking information. In some implementations, the feature extraction model is further configured to generate a first visual feature representation of the first image-format file based on the text sequence, the layout information, and the visual information. In some implementations, the method further comprises obtaining the pre-trained feature extraction model by: determining a third sample document, a third image-format file, and match labeling information, the match labeling information indicating whether the third image-format file matches with the third sample document; generating, using the feature extraction model, a second visual feature representation of the third image-format file based on the third sample document and the third image-format file; determining, based on the second visual feature representation, a match result indicating whether the third image-format file matches with the third sample document; and pre-training the feature extraction model based on a difference between the match result and the match labeling information. In a six aspect, the subject matter described herein provides an electronic device. The electronic device comprises: a processor; and a memory coupled to the processor and having instructions stored thereon, the instructions, when executed by the processor, causing the device to perform acts comprising: determining a text sequence, layout information and order labeling information presented in a first sample document, the text sequence comprising a first set of text elements, the layout information indicating a spatial layout of the first set of text elements in the first sample document, the order labeling information indicating a ground-truth reading order of the first set of text elements in the first sample document; generating, using a feature extraction model, respective semantic feature representations of the first set of text elements based at least on the text sequence and the layout information; determining, using an order determination model, a predicted reading order of the first set of text elements in the first sample document based on the semantic feature representations; and training the feature extraction model and order determination model based on a difference between the predicted reading order and the ground- truth reading order. In some implementations, the first sample document comprises an editable text document. In some implementations, determining the order labeling information comprises: determining format information corresponding to the editable text document, the format information at least indicating the ground-truth reading order of the first set of text elements. In some implementations, determining the layout information comprises: determining a vector file corresponding to the first sample document; and determining the layout information of the first set of text elements from the vector file. In some implementations, a plurality of text elements that occur at different positions in the first sample document and represent a same text are assigned with different indices, and the plurality of text elements are labeled with different colors in the vector file, the color with which each text element is labeled being determined based on the index assigned to the text element. In some embodiments, determining the layout information of the first set of text elements from the vector file comprises: assigning, based on the indices and the colors assigned to the plurality of text elements, layout information determined from the vector file to the plurality of text elements extracted from the first sample document. In some implementations, generating the semantic feature representations comprises: determining visual information from a first image-format file corresponding to the first sample document, the visual information indicating visual appearances of the first set of text elements presented in the first sample document; and generating, using the feature extraction model, the semantic feature representations further based on the visual information. In some implementations, the method further comprising obtaining the feature extraction model that is pre-trained by: determining a second image-format file corresponding to a second sample document, the second sample document comprising a second set of text elements; generating, by masking at least one of the second set of text elements in the second image-format file, first masking information to indicate that the at least one text element is masked and other text elements of the second set of text elements are not masked; determining, using the feature extraction model, respective semantic feature representations of the second set of text elements; determining second masking information based on the respective semantic feature representations of the second set of text elements, the second masking information indicating whether respective text elements of the second set of text elements are masked; and pre-training the feature extraction model based on a difference between the first masking information and the second masking information. In some implementations, the feature extraction model is further configured to generate a first visual feature representation of the first image-format file based on the text sequence, the layout information, and the visual information. In some implementations, the method further comprises obtaining the pre-trained feature extraction model by: determining a third sample document, a third image-format file, and match labeling information, the match labeling information indicating whether the third image-format file matches with the third sample document; generating, using the feature extraction model, a second visual feature representation of the third image-format file based on the third sample document and the third image-format file; determining, based on the second visual feature representation, a match result indicating whether the third image-format file matches with the third sample document or not; and pre-training the feature extraction model based on a difference between the match result and the match labeling information. In a seventh aspect, the subject matter described herein provides a computer program product being tangibly stored in a computer storage medium and comprising computer-executable instructions, the computer-executable instructions, when executed by a device, causing the device to perform one or more implementations of the method according to the fifth aspect. In an eighth aspect, the subject matter described herein provides a computer readable medium having computer-executable instructions stored thereon, the computer-executable instructions, when executed by a device, causing the device to perform one or more implementations of the method according to the fifth aspect. The functionalities described herein can be performed, at least in part, by one or more hardware logic components. As an example, and without limitation, illustrative types of hardware logic components that can be used include field-programmable gate arrays (FPGAs), Application- specific Integrated Circuits (ASICs), application-specific standard products (ASSPs), system-on- a-chip systems (SOCs), complex programmable logic devices (CPLDs), and the like. Program code for carrying out the methods of the subject matter described herein may be written in any combination of one or more programming languages. The program code may be provided to a processor or controller of a general-purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowcharts and/or block diagrams to be implemented. The program code may be executed entirely or partly on a machine, executed as a stand-alone software package partly on the machine, partly on a remote machine, or entirely on the remote machine or server. In the context of this disclosure, a machine-readable medium may be any tangible medium that may contain or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include but is not limited to an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of the machine-readable storage medium would include an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random-access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations are performed in the particular order shown or in sequential order, or that all illustrated operations are performed to achieve the desired results. In certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are contained in the above discussions, these should not be construed as limitations on the scope of the subject matter described herein, but rather as descriptions of features that may be specific to particular implementations. Certain features that are described in the context of separate implementations may also be implemented in combination in a single implementation. Rather, various features described in a single implementation may also be implemented in multiple implementations separately or in any suitable sub-combination. Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter specified in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.
Next Patent: SYSTEMS AND METHODS FOR ADAPTIVE SELF-REFERENCED READS OF MEMORY DEVICES