Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
SYSTEM AND METHOD FOR AUTOMATIC DIACRITIZING VIETNAMESE TEXT
Document Type and Number:
WIPO Patent Application WO/2014/138756
Kind Code:
A1
Abstract:
Systems and methods for automatic diacritizing Vietnamese text entered using physical and virtual computer keyboard are provided. In accordance with some embodiments, a method for automatic diacritizing Vietnamese text is provided, the method comprising: detecting a phrase ending character and automatically diacritizing a previously entered phrase while user may continue entering other phrases; detecting a special character, or, on a virtual computer keyboard, a touch event on a previously enter word, to allow manual diacritizing Vietnamese text.

Inventors:
DANG THI MAI HUONG (VN)
NGUYEN VIET HAI (VN)
Application Number:
PCT/VN2013/000005
Publication Date:
September 12, 2014
Filing Date:
April 12, 2013
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
DANG THI MAI HUONG (VN)
NGUYEN VIET HAI (VN)
International Classes:
G06F3/01; G06F17/27
Foreign References:
US20130006613A12013-01-03
US20080077396A12008-03-27
US20130050098A12013-02-28
US20110087484A12011-04-14
Other References:
TUAN ANH LUU ET AL: "A Pointwise Approach for Vietnamese Diacritics Restoration", ASIAN LANGUAGE PROCESSING (IALP), 2012 INTERNATIONAL CONFERENCE ON, IEEE, 13 November 2012 (2012-11-13), pages 189 - 192, XP032339752, ISBN: 978-1-4673-6113-2, DOI: 10.1109/IALP.2012.18
MINH TRUNG NGUYEN ET AL: "Vietnamese Diacritics Restoration as Sequential Tagging", COMPUTING AND COMMUNICATION TECHNOLOGIES, RESEARCH, INNOVATION, AND VISION FOR THE FUTURE (RIVF), 2012 IEEE RIVF INTERNATIONAL CONFERENCE ON, IEEE, 27 February 2012 (2012-02-27), pages 1 - 6, XP032138192, ISBN: 978-1-4673-0307-1, DOI: 10.1109/RIVF.2012.6169816
RUHI SARIKAYA ET AL: "Maximum Entropy Modeling for Diacritization of Arabic Text", INTERSPEECH 2006 -INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING) INTERSPEECH 2006 - ICSLP : NINTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING ; PITTSBURGH, PENNSYLVANIA, USA, SEPTEMBER 17 - 21, 2006 / ISCA, BONN : ISCA, 2006, DE, 17 September 2006 (2006-09-17), pages 145 - 148, XP008163575
MOUSTAFA ELSHAFEI ET AL: "Statistical Methods for Automatic Diacritization of Arabic Text", 18 November 2000 (2000-11-18), pages 1 - 8, XP002624239, Retrieved from the Internet [retrieved on 20110221]
Download PDF:
Claims:
What is claimed is:

1. A method of automatic diacritizing Vietnamese text using an electronic computing device, the method comprising:

detecting a phrase ending character and automatically diacritizing a previously entered phrase while user may continue entering other phrases;

detecting a special character to allow manual diacritizing Vietnamese text.

2. The method of claim 1, wherein the auto diacritization of a phrase is performed using a vocabulary of Vietnamese words and phrases, each word or each phrase associated with a score.

3. The method of claim 2, wherein the auto diacritization of a phrase is performed by a solver searching a vocabulary for the match of each syllable in the entered phrase, generating all possible diacritized phrases and choosing the diacritized phrase with the highest score.

4. The method of claim 2, wherein the vocabulary is built, extended and maintained using a large corpus of Vietnamese text.

5. The method of claim 2, wherein the vocabulary may be shared among a plurality of users.

6. The method of claim 1 wherein the manual editing of Vietnamese text with diacritics using either TELEX or VNI or VI QR methods or any

combination thereof.

7. A method of automatic diacritizing Vietnamese text using an electronic device with touch screen keyboard, the method comprising:

detecting a phrase ending character to automatically diacritize a previously entered phrase while user may continue entering other phrases;

detecting a touch event on a previously enter word to show a list of word matching options, replacing previously entered word by a user-selected option.

8. The method of claim 7, wherein the auto diacritization of text is performed using a vocabulary of Vietnamese words and phrases, each word or each phrase associated with a score.

9. The method of claim 7, wherein the auto diacritization of text is performed by a solver searching a vocabulary for match of each syllable in the entered phrase, generating all possible diacritized phrases and choosing the diacritized phrase with the highest score.

10. The method of claim 7, wherein the list of word matching options is sorted based on the score associated with each word in the vocabulary.

11. The method of claim 7, wherein the sorted list of word matching options is organized such that only the first three highest scored options are shown to the user; all the remaining options are only shown upon explicit request from the user.

12. A system for automatic diacritizing Vietnamese text, the system comprising:

at least one processor;

at least one computer readable memory containing computer executable instructions, the instructions to perform the method of:

detecting a phrase ending character entered by user and diacritizing a previously entered phrase by employing a solver to (i) search a vocabulary of words and phrases, each word or each phrase associated with a score for a match of each syllable in the entered text, (ii) generate all possible diacritized phrases and (iii) choose the diacritized phrase with the highest score,

detecting a special character entered by user and adding, removing or changing diacritics of word previously entered or diacritized by employing either TELEX or VNI or VIQR typing methods,

building, updating and maintaining a vocabulary of words and phrases, each word or each phrase associated with a score, using a large corpus of Vietnamese text.

13. The system of claim 12, further comprising a large corpus of text for calculating scores associated with any input word or phrase.

14. The system of claim 12 wherein the said processor or a plurality of processors are hosted on a standalone system.

15. The system of claim 12 wherein the solver, the vocabulary and the text corpus are hosted on a server or a plurality of servers.

16. A system for automatic diacritizing Vietnamese text on an electronic device with touch screen keyboard, the system comprising:

at least one processor;

at least one computer readable memory containing computer executable instructions, the instructions to perform the method of:

detecting a phrase ending character entered by user and diacritizing a previously entered phrase by employing a solver to (i) search a vocabulary of words and phrases, each word or each phrase associated with a score for a match of each syllable in the entered text, (ii) generate all possible diacritized phrases and (iii) choose the diacritized phrase with the highest score,

detecting a touch event on a previously enter word to show a list of word matching options, replacing previously entered word by a user-selected option^ building a vocabulary of words and phrases, each word or each phrase associated with a score.

17. The system of claim 16 wherein the solver, the vocabulary and the text corpus may be hosted on a server or a plurality of servers.

Description:
SYSTEM AND METHOD FOR AUTOMATIC DIACRITIZING

VIETNAMESE TEXT

TECHNICAL FIELD OF THE INVENTION

The present invention relates to diacritization of Vietnamese text and more particular to an automatic diacritization system and method to support typing and editing Vietnamese on a physical or virtual keyboard.

BACKGROUND

Vietnamese alphabet is based on Latin alphabet. However, in Vietnamese, each vowel may have from 6 to 18 variants using up to two levels of diacritical marks. For example the letter a has 18 diacritical variants: a ά ά ά α α ά ά ά a a a ά ά ά ά ά ά

Altogether, there are 72 different vowels or vowel variants and 17 different consonants in the Vietnamese alphabet. Including upper case letter, the total number of letters in Vietnamese alphabet is 178. Among them, 134 letter with diacritical marks are not available on a standard keyboard layout such as QWERTY or AZERTY.

Another important characteristic "of Vietnamese language is that a word may consist of one or more syllables separated from each other, which means blank characters or space between syllables are not representing word boundary as it is customary in English and many other languages. The multiple, separated syllable nature of word in Vietnamese means that Vietnamese text written without diacritics is even more ambiguous, more prone to be misunderstood and misinterpreted.

Popular Vietnamese typing methods such as Telex, VNI or VIQR and software system implementing those typing methods, make use of special letters not in the Vietnamese alphabet such as [w,f,j,z], or some special characters or numbers [1234567890] or a combinations of those characters, to manually enter diacritics using standard keyboard layout.

However, manually typing Vietnamese text with diacritical marks on a standard computer keyboard is time consuming, especially for people not skilled with Vietnamese typing methods such as Telex, VNI and VIQR. Many people opt for typing Vietnamese text without diacritics, which makes reading and understanding those text difficult for other readers, sometimes even for

Vietnamese native speakers. On mobile device with small virtual or physical keyboard, typing Vietnamese with diacritical marks is often cumbersome.

Recently, a text editing program that suggests diacritical variants for each non-diacritical syllable entered by user has been implemented on personal computers. Another text editing program, implemented on personal computers, can automatically diacritize non-diacritical syllables entered by user if they match a word from a dictionary created when the text editing program is installed. Various web-based typing software applications allow adding diacritics to non-diacritical text have also been implemented.

Even though there have been a few studies and software programs tackling the diacritization issue in Vietnamese, they still provide a limited solution to the problem in terms of user experience, accuracy and scale. SUMMARY OF THE INVENTION

It is therefore a primary objective of the present invention to provide a method and a system which (i) automatically diacritizes non-diacritical text entered by user, without any manual intervention from user (e.g. to choose between different word variants); and (ii) allows user to type and edit in the same text area, without the need to switch between edit and type mode.

This object is achieved by designing a user interface component that keep track of the movement of user typing cursor to predict user intention based on the current and historical context: when user is in typing mode or has just finished typing a word, a phrase or a sentence; and when user is in editing mode or is about to correcting a syllable. In addition, the user vocabulary, used by the language model for automatic diacritization of text, can optionally be shared among users. As such, any improvement to the automatic diacritization will be beneficial to all shared users.

Due to the ambiguity nature of non-diacritical Vietnamese text, even with native speakers, it is not always easy to choose correct diacritized words from a few valid options. Therefore the automatically diacritized text may still be incorrect. According to the invention, the system allows users to manually edit or correct an incorrectly diacritized syllable using a popular Vietnamese typing method such as Telex, VNI or VIQR. Optionally, especially on mobile device, the system allows user to tap on the incorrectly diacritized syllable and then choose from a pop up list of word correction options the correct syllable with diacritics.

A system and method for diacritization of text, according to this invention includes: detecting phrase ending characters entered by user and diacritizing the most recently entered phrase by employing an optimization solver to search for the diacritized phrase with the highest score; detecting special characters entered by user and adding, removing or changing diacritics of word previously entered or diacritized by employing either TELEX or VNI or VIQR typing methods; building, updating and maintaining a vocabulary of phrases with score.

As an alternative, a system and method for diacritization of text on an electronic device with touch screen keyboard includes: detecting phrase ending characters entered by user and diacritizing a previously entered phrase by employing a solver to search for the diacritized phrase with the highest score; detecting a touch event on a previously enter word to show a list of word correction options, and replacing previously entered word by a user-selected correction option.

These and other objects, features and advantages will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing and other objects, features and advantages will become apparent from the following detailed description of preferred embodiments with reference to the following figures wherein:

FIG. 1 is a diagram representation of the system for automatic

diacritization of text, according to the present invention.

FIG. 2 is a diagram representation of the system for automatic

diacritization of text, according to the present invention, for touch device.

FIG. 3 is a block diagram showing the logic to determine, when automatic diacritization of text is performed and when manual diacritization by user is initiated.

FIG. 4 is a block diagram showing the logic for manually editing or correcting text with diacritics on a touch device according to the present invention.

FIG. 5 is a block diagram showing the steps to automatically diacritize text, according to the present invention.

DETAILED DESCRIPTION OF THE INVENTION

The accompanying figures and the description that follows set forth the disclosed and claimed concept in its preferred embodiments. It is, however, contemplated that persons generally familiar with statistical natural language processing and computer programming will be able to apply the novel characteristics of the methods illustrated and described herein in other contexts by modification of certain details. Accordingly, the figures and description are not to be taken as restrictive on the scope of the disclosed and claimed concept, but are to be understood as broad and general teachings.

Referring to FIG. 1, a general view of a preferred embodiment of the system for diacritization of text is provided. The user interface event handler 100 is responsible for handling, processing all text editing and typing

interactions on the keyboard and showing their effect on the display. The manual diacritizer 101 is invoked when user press a special key in the keyboard to manually put diacritics in the text. Depending on which special key is pressed, the corresponding typing method among the three implemented typing methods, TELEX, VNI, VIQR, is chosen. For examples, [weroasdfjxz] are the special keys for TELEX method, [0123456789] are for VNI and Γ~ Λ (+\' /?] are for VIQR. When user completed typing a phrase or a sentence, as evident by a phrase ending character that is among, but not limited to the following characters [.,:;?!'"()], the auto diacritizer 102 may be invoked to decide if the recently entered phrase needs to be diacritized. After analyzing the context, auto diacritizer 102 may invoke the solver 104 to search in the vocabulary 106 for words or phrases matching user input, generate candidate phrases and select the solution with the highest score. The logic for auto diacritization of text in the solver 104 will be described in details later in FIG. 5. The solver 104 can be configured by using the data manager 103. Solver configuration options may be solver type, for examples, either exact or approximate maximum matching, as evident by the highest score, or type of language model used, for example either 3-gram or 5-gram language model. The data manager 103 is also responsible for maintaining the text corpus 108. Using the data manager 103, functionalities of the score calculator 105 can be invoked, to train the language model that is implemented by the vocabulary 106. The score calculator 105 is responsible for updating the language model, calculating for words and phrases the

corresponding scores that are based on statistical characteristics such as word frequency count, n-gram probabilities, derived from the text corpus 107.

FIG 2. shows a diagram representating an alternative embodiment of the system for auto diacritization of text as implemented on touch devices, for example mobile phone. A user interacts with the device via a keyboard and the result of that interaction is shown on a display. In this embodiment, a user may express an intention to edit an incorrectly diacritized portion of text in the touch screen by tapping or touching that portion of the text.

The keyboard and display event handler 200 is responsible for handling all user interactions with the touch device. The touch editor 201 and the auto diacritizer 202 are responsible for processing all text editing and typing activities handled by the keyboard and display event handler 200. The logic for manual editing of diacritics in the touch editor 201 will be described in details later in FIG. 4.

The auto diacritizer 202 is generally invoked when user enters a phrase ending character. The auto diacritizer 202 then analyzes the context to decide if the recently entered phrase needs diacritization. After that the auto diacritizer 202 may invoke the solver 204 to search in the vocabulary 205 for words or phrases matching user input, generate candidate phrases and select the phrase with the highest score as the solution.

The solver 204 can be configured by user of the touch device using the settings 203. User can also use the settings 203 to manually update the vocabulary 205.

FIG. 3 provides a block diagram showing the logic to determine when automatic diacritization of text phrase and when manual diacritization should be performed. User interactions with the device are first looked at by the user interface event handler 100. In step 301 the user input is tested if it is a key press event. If it is not, the control is returned back to the user interface event handler 100. If the user interaction causes a key press event then in the next step 302, the input character is tested to determine if it is a phrase ending character. If the input character is a phrase ending character then in step 303 the recently entered text is analyzed to determine if diacritics is needed for any portion of the phrase. If the answer is positive, the backend process for auto diacritization of phrase is initiated in the next step 305 and then the control will be returned to the user interface event handler 100. If in the test 303 it is determined that no

diacritization is needed, the control is returned to the user interface event handler 100. If in step 302 the input character is not a phrase ending character, then it will be tested again in step 304 to determine if it is one of the special characters used by one of the manual text typing methods listed, TELEX, VNI, VIQR, to add, change or remove diacritics. Further context analysis may also be done to determine that the adjacent text is not a foreign language word and potentially needs diacritic. If the test 304 is positive, the manual diacritizer will be invoked in the next step 306. If in step 304 is is determined that the input character is not related to any of the manual typing method, or the adjacent text is a foreign word, then the control will be returned to the user interface event handler 100.

Note that the automatic diacritization of a phrase initiated in step 305 may be performed while user is typing. Only when the diacritized text is available in step 307 then the corresponding undiacritized text in the text area of the display will be replaced, in step 308, by the diacritized text returned from the solver 104 in FIG.1.

In the preferred embodiment of the system as shown in FIG. 1 , the previously mentioned functionality is implemented using Asynchronous

JavaScript and XML (AJAX), a modern web technology for building fluid user interface in web applications. Instead of using XML (Extensible Markup

Language) as data exchange format, JavaScript Object Notation (JSON) can also be used for transmitting the diacritized text result between the solver, as shown in block 104 on the server side and the auto diacritizer 102 responsible for delivering the diacritized text to the user interface event handler 100.

In another alternative embodiment of the system implemented for touch device as shown in FIG. 2, the use of AJAX and JSON may not be necessary if all the system components reside in the same device. The diacritized text result can also be transmitted synchronously between the solver 204 and the auto diacritizer 202. Users of a touch device may want to correct a word or a syllable with missing diacritics or incorrectly diacritized. To edit an incorrectly diacritized syllable, user may tap or touch that syllable, a list of possible options for correcting the diacritized syllable will be shown. User then can choose one of the 3 top ranked options with the highest scores for correction. If the option user want is not among these 3 options, user can tap on to choose from other available options.

FIG. 4 provides a block diagram showing the logic for manually editing or correcting text with diacritics, on a touch device, by the touch editor 202, according to the present invention. The software component handling all touch events is again depicted as the keyboard and display event handler 200, which may alternatively be understood as an infinite loop checking if a user touch event is taking place. The manual editing process is started only when the test 401 is positive, typically when user taps or touches the text area. When a touch event is detected in the typing area of the touch screen, a syllable s may be identified as the object of the touch event, the phrase P enclosing s is considered as the context for the event. These two pieces of data will then be used in step 403 to invoke a function suggest _word to find all possible corrections for the incorrectly diacritized syllable s. In step 405 the function suggest word will calculate and return a sorted list of correction options for s. The option list will then be shown to the user in step 406 in the format of a popup menu, with the 3 most probable options at the top of the list and other options, if exist, available via a touch on the list. Once it is detected in test 402 that the user has selected an option si from the option list, the incorrectly diacritized syllable s will be replace by si in step 404. After that, control is returned to the keyboard and display event handler 200.

FIG. 5 is a block diagram showing the steps to automatically diacritize text, according to the present invention. These are the steps implemented in the solver 104 in FIG. 1. First, in step 500 the input undiacritized text, a phrase, is tokenized into syllables. Then in step 501, for each syllable si in the input text, a search, for all words or phrases in the system vocabulary 106 that have undiacritized part matching si, is performed. Next, in step 502, a list of all possible phrases is generated, based on the list of matching word for each syllable si, and the order of all syllable sts in the input text. In step 503, the list of possible phrases is filtered and only grammatically valid phrases are kept. Note that each matching word is associated with a score and the score of a phrase may be calculated as sum or product of its syllables. These candidate phrases are fed into an optimization solver in step 504 to calculate and return a phrase with maximal score.

While specific embodiments of the disclosed and claimed concept have been described in detail, it will be appreciated by those skilled in the art that various modifications and alternatives to those details could be developed in light of the overall teachings of the disclosure. Accordingly, the particular arrangements disclosed are meant to be illustrated only and not limiting as to the scope of the disclosed and claimed concept which is to be given the full breadth of the claims appended and any and all equivalents thereof.