Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
LANGUAGE IDENTIFICATION IN MULTILINGUAL TEXT
Document Type and Number:
WIPO Patent Application WO/2012/050743
Kind Code:
A3
Abstract:
Methods, systems, and media are provided for identifying languages in multilingual text. A document is decoded into a universal representative coding for easier tag manipulation, then broken into plain-text content sections. The sections are identified and assigned a weight, wherein more informative sections are given a higher weight and less informative sections are given a lesser weight. A language likelihood score is determined for each word, phrase, or character n-gram in a section. The language likelihood scores within a section are combined for each language. The combined section scores are then summed together to obtain a total document score for each language. This results in a document score for each language, which can be ranked to determine the primary language for the document.

Inventors:
LI KANG (US)
KLODER STEPHEN ALLEN (US)
JOHNSON IAN GEORGE (US)
ALONICHAU SIARHEI (US)
Application Number:
PCT/US2011/052133
Publication Date:
June 21, 2012
Filing Date:
September 19, 2011
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
MICROSOFT CORP (US)
International Classes:
G06F17/21; G06F9/44; G06F17/28; G06F40/00
Foreign References:
US20080281577A12008-11-13
US20100138211A12010-06-03
US20090198487A12009-08-06
US20090182547A12009-07-16
Other References:
See also references of EP 2628095A4
Download PDF: