Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
METHOD AND A SYSTEM FOR PERFORMING AN OBLIVIOUS QUERY ISSUED BY A FIRST PARTY ON A STRING PROVIDED BY A SECOND PARTY
Document Type and Number:
WIPO Patent Application WO/2008/135951
Kind Code:
A1
Abstract:
This invention relates to a method for performing an oblivious query issued by a first party on a string xB provided by a second party. A query is provided in the form of a finite automaton M by a first party. An interactive protocol is run between the first and the second party to determine whether the string xB is accepted by the automaton M, without exchanging with each other the string xB and the finite automaton M.

Inventors:
KATZENBEISSER STEFAN (NL)
TRONCOSO-PASTORIZA JUAN R (ES)
CELIK MEHMET U (NL)
Application Number:
PCT/IB2008/051771
Publication Date:
November 13, 2008
Filing Date:
May 07, 2008
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
KONINKL PHILIPS ELECTRONICS NV (NL)
KATZENBEISSER STEFAN (NL)
TRONCOSO-PASTORIZA JUAN R (ES)
CELIK MEHMET U (NL)
International Classes:
G06F21/00; G06F21/62; G16B30/10; H04L9/00
Other References:
FLORIAN KERSCHBAUM: "Practical Private Regular Expression Matching", IFIP. INTERNATIONAL FEDERATION FOR INFORMATION PROCESSING, XX, XX, vol. 201, 1 January 2006 (2006-01-01), pages 461 - 470, XP008096239, ISSN: 1571-5736
PARIS SMARAGDIS ET AL: "A Framework for Secure Speech Recognition", IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE SERVICE CENTER, NEW YORK, NY, US, vol. 15, no. 4, 1 May 2007 (2007-05-01), pages 1404 - 1413, XP011177218, ISSN: 1558-7916
ATALLAH M J ET AL: "Secure and Private Sequence Comparisons", PROCEEDINGS OF THE 2003 ACM WORKSHOP ON PRIVACY IN THE ELECTRONIC SOCIETY. WPES'03. WASHINGTON, DC, OCT. 30, 2003; [PROCEEDINGS OF THE ACM WORKSHOP ON PRIVACY IN THE ELECTRONIC SOCIETY], NEW YORK, NY : ACM, US, 30 October 2003 (2003-10-30), pages 39 - 44, XP002435200, ISBN: 978-1-58113-776-7
BETEL DORON ET AL: "Kangaroo - A pattern-matching program for biological sequences", BMC BIOINFORMATICS, BIOMED CENTRAL, LONDON, GB, vol. 3, no. 1, 31 July 2002 (2002-07-31), pages 1 - 3, XP021013549, ISSN: 1471-2105
TRONCOSO-PASTORIZA J R ET AL: "Privacy preserving error resilient dna searching through oblivious automata", PROCEEDINGS OF THE 14TH ACM CONFERENCE ON COMPUTER AND COMMUNICATIONS SECURITY, 29 October 2007 (2007-10-29) - 2 November 2007 (2007-11-02), Alexandria, Virginia, USA, pages 519 - 528, XP002495762, Retrieved from the Internet [retrieved on 20080910]
Attorney, Agent or Firm:
UITTENBOGAARD, Frank et al. (AE Eindhoven, NL)
Download PDF:
Claims:

CLAIMS:

1. A method for performing an oblivious query issued by a first party on a string X B provided by a second party, comprising: providing a query (301 ) in the form of a finite automaton M by the first party, wherein the first and the second parties engage interactively (303) without exchanging with each other the string X B and the finite automaton M and where the second party uses the string X B as an input for determining whether the string X B is accepted by the automaton M (305).

2. A method according to claim 1, wherein the second party repeatedly provides an encrypted vector of binary values to the first party, the first and the second party repeatedly performing an oblivious transfer protocol.

3. A method according to claim 1, wherein automaton represents a regular expression r.

4. A method according to claim 1, wherein the automaton M represents a string X A and wherein determining whether the string X B is accepted by the automaton M comprises applying approximate searching for determining whether X A is approximately present within the string X B .

5. A method according to claim 4, wherein the approximate searching further includes determining whether the number of edit errors that the string X A must undergo in order to match string X B is below a pre-determined threshold, wherein in case the number of edit errors is within the pre-determined threshold a match is declared between the strings X A and XB.

6. A method according to claim 4 or 5, wherein the string X A is shorter or equal in length as string x B .

7. A method according to claim 1, wherein the string X B represents: Desoxyribo-Nucleic Acid (DNA) data,

RNA data, sequence of amino-acids, or - protein sequences.

8. A method according to claim 7, wherein the query is a request to check whether the string X B codes a pre-disposition to a particular disease, the automaton M being indicative for said disease.

9. A method according to claim 3, wherein the regular expression r is used in a password format validation or data parsing.

10. A computer server (403) comprising a processor (404) for processing the method according to claim 1.

11. A computer client (410) comprising: input means (411) for enabling a user (420) to invoke the method according to claim 1; - receiving means (413) for receiving a result of invoking the method according to claim 1; a display (414) for displaying the result; and a processor (512) for processing the invocation, the received result and the displaying.

12. A system (400) comprising searching means (419) for performing an oblivious query on a string X B (401) provided by a second party (420), where the searching means (419) is adapted to determine whether the string X B (401) is accepted by a finite automaton M provided by a first party (403), wherein the first (403) and the second (420) parties engage interactively without exchanging the string X B and the finite automaton M with each other.

13. A system according to claim 12, wherein the searching means (419) is adapted to perform an error-resilient privacy-preserving string searching within a DNA sequence of a person.

14. A computer program product to be loaded by a computer arrangement, the computer arrangement comprising processing unit and a memory, the computer program product, after being loaded, providing said processing unit with the capability to carry out the method of claim 1.

Description:

Method and a system for performing an oblivious query issued by a first party on a string provided by a second party

FIELD OF THE INVENTION

The present invention relates to a method and a system for performing an oblivious query issued by a first party on a string X B provided by a second party.

BACKGROUND OF THE INVENTION

The human genome contains a wealth of information about a person's body; broad access to the genome is likely to revolutionize medical diagnosis and treatment. Doctors can, for example, use genomic information to test whether a person has a predisposition towards developing a specific disease, even years before the first symptoms appear. In treatment, genomic data may be used to predict whether a patient will react positively against a specific therapy or whether the treatment will likely fail, thereby reducing the overall costs and increasing the effectiveness of the therapy. Finally, it may be possible to create an individualized drug therapy for each patient by analyzing his genetic profile and predicting his response to different medications. Broad access to and storage of personal Desoxyribo-Nucleic Acid (DNA) sequences involves significant risks to personal privacy and may open the door for discrimination based on genomics. For instance, a person carrying a gene known to increase the likelihood of a particular cancer may be denied coverage by the health insurance company; an employee may be rejected for a permanent work contract due to his pre- disposition towards a disabilitating disease; or the discovery of parental relationships via DNA profiling may have undesirable consequences for the person's private life. These are only some of the risks we can foresee at this time; once the functionality of the human genome is fully uncovered there may be even more significant risks to privacy. As we move forward, there is a clear and emerging need for privacy-preserving mechanisms for the protection of genomic data.

Privacy concerns about DNA information have traditionally been addressed through laws and procedures: Healthcare professionals are required to keep sensitive data confidential and make it available only with explicit consent of the patient. So far this

traditional approach has worked reasonably well, mostly due to the limited availability and use of genomic profiles in established medical centers. Nonetheless, as genomic profiles become ubiquitous, this traditional form of protection may be insufficient to prevent sensitive information leakage. It is believed that cryptographic privacy-preserving protocols will become invaluable components that complement the procedural approach.

The following problem setup may be considered: A patient has a digital record of her DNA sequence and wants to give another party (such as her healthcare provider) selective access to run a query on this record, for instance, to find out whether she has a predisposition to a particular disease. As she is concerned about her privacy, she does not want to disclose her DNA profile to the health-care provider. On the other hand, her health-care provider may like to keep the details of the query confidential as it is commercially valuable.

BRIEF DESCRIPTION OF THE INVENTION

The object of the present invention is to overcome the above mentioned drawbacks by providing a method and a system for error-resilient privacy-preserving string searching suitable for e.g. DNA queries.

According to one aspect, the present invention relates to a method for performing an oblivious query issued by a first party on a string X B provided by a second party, comprising: - providing a query in the form of a finite automaton M by the first party, wherein the first and the second parties engage interactively without exchanging with each other the string X B and the finite automaton M and where the second party uses the string X B as an input for determining whether the string X B is accepted by the automaton M. Thus, a method is provided for the oblivious execution of a finite state automaton. More precisely, when two parties, one holding a description of automaton M and the other holding a string X B , execute the method/protocol, the result indicates whether M accepts X B without revealing M or X B to the other party.

The reason for using an automaton instead of a dynamic programming algorithm resides in the fact that an automaton has pre-defined transitions, and it does not need any comparison while traversing the input string. Comparison is one of the most expensive operations on encrypted data. By using a finite automaton, all the comparisons can be avoided, because they are "hard wired" in the automaton itself. Furthermore, using an

automaton allows the implementation of any problem represented in the form of a regular expression, endowing the privacy-preserving solution with a strong generality.

The term oblivious is well known in computer science and denotes a privacy preserving criteria, namely that one party does not want to exchange data with another party, and vice versa. In this case, party B possessing X B does not want to exchange his data with party A (e.g. a hospital server) possessing the automaton M, and vice versa, party A does not want to exchange the automaton M with party B because e.g. the automaton M might be of commercial value. Both parties however want to run the finite automaton (or finite state machine, FSM) on the second party's input, in such a way that the first party will not get any information about the string X B and the second party can only get an answer saying whether the finite automaton M accepts the string X B , without knowledge of M. The only information the second party can get about the FSM is the number of states.

The term query may be interpreted as a "privacy-preserving checking of a regular expression" or "privacy-preserving execution of an automaton". In one embodiment, the second party repeatedly provides an encrypted vector of binary values to the first party, the first and the second party repeatedly performing an oblivious transfer protocol.

Thus, it is ensured that the query is protected.

In one embodiment, the automaton represents a regular expression r. Thus, the automaton is limited to a finite state machine without output (or only single output, i.e. the final state).

In one embodiment, the automaton M represents a string X A and wherein determining whether the string X B is accepted by the automaton M comprises applying approximate searching for determining whether X A is approximately present within the string XB.

This is of a particular advantage since the strings may undergo various errors, e.g. due to chemical sequencing process or mutations. Thus, these errors are taken into account when doing the approximate searching.

In one embodiment, the approximate searching further includes determining whether the number of edit errors that the string X A must undergo in order to match a substring of string X B is below a pre-determined threshold, wherein in case the number of edit errors is within the pre-determined threshold a match is declared between the strings X A and XB-

Admissible errors in an approximate search are the typical errors to which e.g. a DNA string is subjected by chemical sequencing process or mutations. They are also present in text typing, due to human or Optical Character Recognition errors when transcribing some text. In one embodiment, the string X A is shorter or equal in length as string X B .

In one embodiment, the string X B represents:

Desoxyribo-Nucleic Acid (DNA) data,

RNA data, sequence of amino-acids, or - protein sequences.

Thus, the implementation of automaton can be implemented on different types of genomic sequences. Accordingly, the regular expression r may be considered as a disease template.

In one embodiment, the query is a request to check whether the string X B codes a pre-disposition to a particular disease, the automaton M being indicative for said disease.

Thus, it is possible to check whether the second party can suffer from a particular disease in a secure manner without requiring that the second party discloses to the first party its input string Xβ. It is also ensured that the first party does not disclose the regular expression r to the second party as it is commercially valuable. In one embodiment, the regular expression r is used in a password format validation or data parsing.

It is namely so that a regular expression can indicate the format that a given text must conform to in order to be considered valid, and this is normally the first step of a validation process that protects the validator from entries that are out of domain and would likely cause errors. Whatever the validated information is, the need for privacy in the validator extends also to the need of privacy for the format checker.

According to another aspect, the present invention relates to a computer server comprising a processor for processing the above mentioned method steps.

According to still another aspect, the present invention relates to a computer client comprising: input means for enabling a user to invoke the said method; receiving means for receiving a result of invoking said method; a display for displaying the result; and

a processor for processing the invocation, the received result and the displaying.

According to yet another aspect, the present invention relates to a system comprising searching means for performing an oblivious query on a string X B provided by a second party, where the searching means is adapted to determine whether the string X B is accepted by a finite automaton M provided by a first party, wherein the first and the second parties engage interactively without exchanging the string X B and the finite automaton M with each other.

In one embodiment, the searching means is adapted to perform an error- resilient privacy-preserving string searching within a DNA sequence of a person.

According to yet another aspect, the present invention relates to a computer program product to be loaded by a computer arrangement, the computer arrangement comprising processing unit and a memory, the computer program product, after being loaded, providing said processing unit with the capability to carry out the above mentioned method steps.

The aspects of the present invention may each be combined with any of the other aspects. These and other aspects of the invention will be apparent from and elucidated with reference to the embodiments described hereinafter.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention will be described, by way of example only, with reference to the drawings, in which

Figure 1 shows a plot of number of states of Levenshtein Automaton and its extension as a function of the sequence length, Figure 2 shows a state diagram for the Levenshtein automaton accepting all the sequences of distance < 1 of [actg] (a) and its extension to arbitrary length sequences (b),

Figure 3 shows a flowchart of a method according to the present invention, and

Figure 4 shows a system 400 according to the present invention.

DESCRIPTION OF EMBODIMENTS

Figure 3 shows a flowchart of a method according to the present invention for performing an oblivious query issued by a first party on a string X B provided by a second party.

In step (Sl) 301 a query is provided in the form of a finite automaton M by the first party and kept secret by the first party. As an example, in case of DNA/RNA searching, the automaton may be built directly from the query "is string X A approximately present in string X B T\ This will be discussed in more detail later. The finite automaton M may represent a regular expression r or a string x λ .

In step (S2) 303, the first and the second parties engage interactively without exchanging with each other the string X B and the finite automaton M, where the second party uses the string X B as an input for determining whether the string X B is accepted by the automaton M or not (S3) 305. By the term "engage interactively" is simply meant that an interactive protocol is run between the first and the second party to determine whether the string X B is accepted by the automaton M. A successful run of the automaton determines whether X A is approximately present within the string X B .

In one embodiment, the second party repeatedly provides an encrypted vector of binary values to the first party, where the first and the second party repeatedly perform an oblivious transfer protocol.

In one embodiment, the automaton M represents a string X A and wherein determining whether the string X B is accepted by the automaton M comprises applying approximate searching for determining whether X A is approximately present within the string X B The approximate searching may include determining whether the number of edit errors that the string X A must undergo in order to match a substring of X B is below a pre-determined threshold. In case the number of edit errors is within the pre-determined threshold a match is declared between the strings X A and X B . AS will be discussed later on, if two parties involved in the protocol possess one of the two strings x and y, the approximate searching includes determining whether some substring of y can be obtained from x by applying a constant number of symbol errors, deletions and insertions, without disclosing their strings to each other. The string X A may be shorter or equal in length as string X B . These strings may be Desoxyribo-Nucleic Acid (DNA) data, RNA data, sequence of amino-acids, or protein sequences.

The query may thus be a request to check whether the string X B codes a pre- disposition to a particular disease, where the automaton M is indicative for said disease.

Referring to the above mentioned embodiment, the present invention may be implemented to check if a short string (e.g. a disease template), known to one party, is present inside another longer string (e.g. a DNA sequence of a person), owned by another party, accounting for possible errors and without disclosing to each party the other party's input.

Figure 4 shows a system 400 according to the present invention comprising searching means (S M) 419 for performing an oblivious query on a string X B 401 provided by a second party 420, where the searching means (S_M) 419 is adapted to determine whether the string X B 401 is accepted by a finite automaton M provided by a first party 403, which may e.g. be a hospital server. The first and the second parties engage interactively without exchanging the string X B and the finite automaton M with each other.

As depicted in this embodiment, the system is divided into a computer client 410 and a computer server 403 where the server 403 comprises a processor (404) for processing the method steps in Fig. 3. The computer client 410 comprises an input means (I_M) 411 for receiving said string X B 410 from a user 420 as an input (e.g. a DNA string).

The user 420 may then access said computer server 403 at a first party side, e.g. which can be a hospital server, via a communication network 402 such as the Internet. The processor (P) 404 or the searching means (S M) 419 comprised at the computer server 403 uses the string X B as an input for determining whether the string X B 410 is accepted by the automaton M. The computer client comprises a receiving means (R M) 413 for receiving the result of invoking said method steps, a display (D) 414 for displaying the result, and a processor (P) 412 for processing that operates the input means (I M) 411, the receiving means (R_M) 413 and the display (D) 414.

In the following, the above mentioned method steps and various embodiment will be discussed in more detail.

1.1 Queries on DNA data

A DNA sequence may be denoted as a finite string over the alphabet σ = {A, C,T, G}, representing the four different nucleotides Adenine, Cytosine, Thymine and Guanine (also known as bases). How this sequence regulates human physiology is under investigation. However, one of the main regulation mechanisms is through encoding of proteins. Triplets of nucleotides in particular sections of a DNA sequence, known as coding regions, encode different amino-acids. In turn, a sequence of amino-acids forms a protein, which regulates various functions in the body. The following are different properties that need to be considered in DNA queries that have clinical implications.

• Mutations: A mutation is a deviation on the DNA sequence that may affect one single nucleotide or a sequence of subsequent nucleotides. It may involve substitutions (one nucleotide is converted into another), deletions and insertions (missing or extra nucleotides due to imperfections in the replication process). Specific mutations in the coding

regions are known to be indicative of some diseases. The location of these mutations can be fixed and known in advance or occur at a relative distance from a fixed marker. Often the query for the presence of a mutation is to check whether a certain string xe ∑ * appears in the DNA sequence. A mutation may also appear in a non-coding region of a DNA sequence or it might be clinically irrelevant, in that case the mutation becomes an error that the query should handle gracefully.

• Sequencing Errors: Today, even the best DNA sequencing methods cannot guarantee 100% accuracy. Due to the imperfections of the chemical sequencing process, three different types of errors occur: symbol errors (where an incorrect base is recorded), insertions (where a base is reported in the digital record that is not present in the genome) and deletions (where the sequencing process fails to report a base, even though it is present in the analyzed genome). Queries on DNA sequences should thus be able to cope with infrequent Edit errors. These errors are the same errors that can occur when transcribing a text or when using an Optical Character Recognition (OCR) tool, and they are widely known as Edit errors; there is a known metric (Levenshtein or Edit distance [14]) to quantify the minimum number errors of this nature that a sequence has suffered.

• Many-to-one Mappings: While each triplet of DNA bases encodes one amino- acid, the encoding is not unique: there exist different base triplets that encode the same amino-acid. As only the latter is relevant in diagnosis, DNA queries should handle all possible mappings.

• Incomplete Specifications: In the existing medical genomic databases of DNA sequences, there are many elements that are not completely specified, as the exact effect of punctual mutations has not yet been completely determined, but there is evidence of the relationship between these mutations and a known disease. In this case, the possible queries that can be applied in order to detect those mutations must offer flexibility, apart from the already mentioned error resilience.

A natural representation for a query that is able to cope with the aforementioned properties specific to DNA searches is to use regular expressions. These expressions can be implemented as finite-state machines (finite automata). It should be noted that this representation not only handles incomplete specifications and different mappings, it can also be used to cope with Edit errors due to sequencing problems or clinically irrelevant mutations (cf. Section 5.1).

1.2 Contributions

The technical contributions of this disclosure can be summarized as follows:

• An efficient (amortized linear time) protocol for the oblivious execution of a finite state automaton is presented. More precisely, when two parties, one holding a description of automaton M and the other holding a string x, execute the protocol, the result indicates whether M accepts x without revealing M or x to the other party.

• Also, it is shown how this protocol can be used to solve the problem of oblivious approximate string matching. In this case, each of the two parties involved in the protocol possesses one of the two strings x and y. They collectively determine whether x can be obtained from y by applying a constant number of symbol errors, deletions and insertions, without disclosing their strings to each other. The proposed solution translates one of the strings into a finite-state machine and executes it obliviously.

• Also, it is shown that the solution can be generalized to the approximate sub- string search problem, where the parties want to determine whether x is an 'approximate' substring of y, allowing again a maximal number of symbol errors, insertions and deletions. This problem has not been addressed so far in a privacy-preserving setting.

• The protocol is extended to automata with non-binary output, which are used in applications such as text parsing, computational linguistics and speech recognition. The extended protocol allows for privacy-preserving private execution in these applications.

The rest of the description is organized as follows. Sections 2 and 3 briefly describe some basic concepts needed in the rest of the paper and survey related work. Section 4 shows the proposed solution for privacy preserving execution of any finite automaton, along with a complexity evaluation and a security analysis. Section 5 explains the use of automata for the approximate matching and searching problem. Finally, Section 6 summarizes other applications of the developed protocol.

2. PRELIMINARIES

In this section, the notions used here below are introduced. Boldface lowercase letters will be used or vectors, and uppercase boldface letters for matrices; } denotes the matrix or vector transpose. When not specified, vectors are column vectors. An element at row i and column y of a matrix M will be denoted by M(i, j) . E[x] and D[x] stand for the encryption and decryption of message x with a homomorphic semantically secure cryptosystem [9, 21]. When the parameter is a vector, it will denote the component-wise

encryption and decryption of that vector. The expression αe s A denotes the random choice of a value a from the set A with uniform distribution. Finally, a* will denote the Kleene Closure of a.

2.1 Approximate Matching and Searching

The problem of DNA matching is a particular problem of approximate string matching, in which the used metric is the Edit or Levenshtein distance [14]. This distance accounts for three types of errors, namely substitutions, deletions and insertions. Given two sequences, the matching problem (a.k.a. sequence alignment) consists in checking if the number of edit errors that one sequence must undergo in order to exactly match (with no errors) the other is below a given threshold; in such case, both sequences are declared to match. The three considered errors are typically errors produced by DNA mutations during replication of sequences, as well as the errors produced when sequencing fragments of DNA. In the field of Bio informatics, sequences alignment is used when comparing DNA sequences, in order to determine relations between different genes, or to find some similarities between a genetic disease and known genes or proteins, in order to identify the causes of the symptoms of that disease.

The algorithm that computes sequence alignments is a dynamic programming algorithm, developed by Needleman and Wunsch [20], but similar algorithms are also used for speech recognition [23] and spell checking, as Edit errors are also the typical errors committed when typewriting or when scanning a text and processing it through an OCR (Optical Character Recognition) system. This algorithm gives also the minimum distance path between both sequences. Nevertheless, in most applications, it is not necessary to know the exact path, but only if the distance is below a given threshold (the binary decision problem). This, as pointed out in section 5.1, can be solved by using finite automata.

On the other hand, Approximate string searching applies when one short sequence (the pattern) is looked for inside a longer sequence. If an approximate match (using the Levenshtein distance as metric) is found between the pattern and some substring of the long sequence, the search will report a positive answer. The previous problem of approximate string matching can be seen as a special case of searching when the length of both sequences is comparable. Searching is employed in Bio informatics for looking for a known DNA pattern (which is, for instance, known to correspond to a genetic disease) inside some long DNA sequence of a person's gene. The same problem also arises in other applications like word/pattern finding in a document (spam checkers or virus analyzers).

2.2 Finite Automata

A deterministic finite automaton [11] (or finite state machine, FSM) is denoted by a 5-tuple M = where Q is a finite set of states, σ is a finite input alphabet, q 0 € Q is the initial state, F c: Q is the set of final states, and δ is a transition function.

Without loss of generality, one may restrict himself to 'complete' finite automata, where it is possible to make a transition at each state with every input symbol (each FSM can be transformed into an equivalent complete automaton by adding a sink state). Representing the states as integers in Z | QJ, the inputs as integers in Z\ ∑ \, and the transition function as a matrix δe C^ ∑|x|g i (Z ρ ) > such that A(i,j) represents the next state when the FSM sees an input z e σ and is in current state j ' e Q .

A string x = xoxi. . . XN-I e ∑ N is said to be accepted by the finite automaton M = if the state q N = A ( ■ ■ ■ A (A (qo, X 0 ), xi) ■ ■ ■ , XN-I) is a final state q N G F .

The language accepted by a finite automaton M is the subset of all strings from σ * it accepts. It is known that the sets accepted by FSMs and regular sets coincide. Thus, for every regular expression there is a finite automaton that accepts only words that match that expression, and vice versa. Finite automata can only express decision problems, corresponding to whether a string is accepted or not; thus, binary finite automata are also called acceptors. The theory of automata has been extended to finite state machines that are capable of producing a string over a finite alphabet FI as output. Automata with non-binary output are called transducers, and they can be classified into two groups: • Moore machines: At each transition, the automaton produces one symbol as output, where the output symbol only depends on the current state of the machine. Formally, a Moore machine is a 6-tuple (Q, σ, FI, A,λ,q 0 ), where Q, σ , δ , q 0 have the same meaning as for FSMs. FI denotes the output alphabet, and λ e FT 2 ' is a vector whose components λ(q) encode the output symbol of the machine at a given state q. • Mealy machines: At each transition, the automaton produces one output symbol, which can depend on the transition taken. Formally, a Mealy machine is a 6-tuple (Q,σ,U,A,λ,q 0 ), where Q, σ , δ , q 0 have the same meaning as for FSMs. FI denotes the

output alphabet, and λe cW , Q : (U) is a matrix whose components λ(a, q) encode the output of the machine for a given state q and input symbol a.

In general, Moore machines are as expressive as Mealy machines; however, a Mealy machine may need a smaller number of states than its equivalent Moore automaton.

3. RELATEDWORK

Simple privacy-preserving problems can be solved through the application of generic Secure Multiparty Computation protocols [24, 6]. Nevertheless, most of the generic solutions are not practical. Thus, specific protocols must be developed for efficiently dealing with each privacy-demanding application. The problem posed here below is one exemplary instance where generic solutions yield to particularly inefficient protocols; this is mainly due to the need for error resilience in the search process.

Secure Function Evaluation is a special case of Secure Multiparty Computation [8, 7] in which a set of players want to evaluate a function, known to all players, on their private inputs. Both concepts were introduced by Yao [24]. Subsequently, various approaches to securely evaluating a function have been developed for different function representations, namely combinatorial circuits [8, 24, 12], ordered binary decision diagrams [13], branching programs [18, 17], or one dimensional look-up tables [17]. Each of these approaches can achieve a practical and efficient oblivious protocol for evaluating a given function/, if/ can be expressed in a space-efficient manner in the chosen representation.

The problem of secure function evaluation is symmetric, in the sense that all parties agree on the function to be evaluated, and all parties hold a subset of the input to that function. The only concern is to keep the inputs of each participant private, whereas the function is assumed to be common knowledge. In contrast, the problem considered in the present application is highly asymmetric, in the sense that only one party (the server) knows the function that is evaluated, whereas the other party (the client) holds the corresponding input. From a higher level perspective, both parties agree on some specific functionality that the server implements, and whose inputs are given by the client, who does not know the implementation of that functionality.

To the best of the inventors knowledge, only the work in [2, 3] give an efficient solution for a problem akin to privacy preserving approximate string searching. In that work, the authors present a protocol fox privacy preserving Edit distance evaluation. The Edit distance (or Levenstein distance [14]) computes the minimum number or errors

(substitutions, deletions, and insertions) that one given string must suffer in order to match exactly another given string. The calculation of the Edit distance is performed through a dynamic programming algorithm [4] (named Needleman-Wunch [20] algorithm inside the genomics scientific community) that achieves linear time complexity in the product of the lengths of the aligned sequences. The authors implement an oblivious version of the dynamic programming algorithm that achieves the same order of complexity. If a threshold in the number of admissible errors is established, their protocol can be regarded as a solution to a particular instance of the problem of approximate string matching (cf. Section 2.1). Even though the protocol can be extended to solve the approximate string matching as well, where the target is to discover whether a short string is present within a longer one (within a bounded number of edit errors), the number of comparisons involved in the dynamic programming algorithm grows with the product of the length of the strings, yielding to a practically inefficient solution. The solution proposed here completely avoids comparisons of encrypted values, thus overcoming this scalability problem. Furthermore, the solution disclosed here is more general, as it is not limited to approximate matching or searching, but can be applied to any regular expression matching problem in sequences formed by symbols of a finite alphabet.

For solving the posed problem, similar techniques are used to those of secure function evaluation, among them secret sharing schemes [5], homomorphic encryption and oblivious transfer [19].

4. SECURE EXECUTION OF FINITE AUTOMATA

The problem of obliviously running an automaton can be informally stated as an asymmetric function evaluation problem, in which one party possesses a function/ and the other party owns the input x to that function. One or both of those parties want to obtain the evaluation off(x), but neither party wants to disclose his own data. Here, the function/is implemented as a finite state machine (FSM), and the input x is a Boolean value encoding whether x was accepted by the FSM.

More formally, let be a deterministic FSM, whose description is owned by party A. Let x = X 0 X 1 . . . x N -ι e σ^ be an input to that FSM; x is owned by party B. Both parties want to run the FSM on B's input, in such a way that A will not get any information about the input string x except its length, and the only information B can get about the FSM is its number of states |Q|.

Before presenting the solution of the present invention, results of applying generic solutions will be briefly commented, to the posed problem, in order to clarify the advantages of the proposed protocol.

4.1. Generic solution

The adaptation of protocols for secure function evaluation like Generalized Indirect Indexing [ 18, 17] or Mix and Match [ 12] to this framework poses several problems, as these primitives cannot index two-dimensional matrices (like the state-transition matrix) in an communication efficient way. On the contrary, they are designed for indexing one- dimensional arrays, so the state-transition matrix must be flattened prior to its manipulation; at every step an amount of data equivalent to the whole matrix must be transferred between both parties. This results in a communication complexity of O(N \Q\ |∑|) .

4.2 Proposed Solution A novel solution is proposed for the posed problem, that achieves a significant complexity reduction, as will be shown in section 4.3. For the oblivious run of an automaton in the presented scenario, both parties engage in an interactive protocol, whose number of rounds is linear in the length of the input string x. In particular, the protocol is composed of three sub-protocols, one for performing the first state transition, one for performing an arbitrary transition of the automaton, and one for announcing the result.

The first subprotocol performs the first state transition of the automaton, starting from its initial state, and reading the first input symbol of x. The subprotocol distributes shares of the following state to both A and B. Subsequently, for each further state transition, the second subprotocol is executed. Starting from the shares of the current state, it jointly calculates the transition to the next state in an oblivious way. In order to obtain the output of the transition function without disclosing its inputs (B' s string current symbol and the current state) nor the contents of the transition matrix, homomorphic encryptions and additive shares are used. At the end of the sub-protocol, shares of the subsequent state are distributed to A and B. After all state transitions have been performed (i.e., all symbols of x have been consumed), the last subprotocol is used to determine whether the computation of the automaton ended in a final state.

In the following it is assumed that the encryption system is set up such that B holds the decryption key; however, it can also be implemented with a (fair) threshold encryption scheme requiring a joint decryption step. Furthermore, the output of the protocol

is only revealed to B; again, it could be implemented so that the result is revealed to both parties.

Subprotocol: First State Transition This subprotocol performs the first state transition of the automaton.

1. A generates a random r a (1) G R Z, Q , ; then, he selects the column qo of δ as vector and blinds every element with r a (1) : v< 0) = A{i,q o ) + r^ mod \Q\, i = 0, . . . , | ∑ | - 1.

2. Both parties engage in an OT j ,being A the sender and B the chooser, in which B gets the element with index xo of V 0 K This element corresponds to q W + r ω m od |ρ|.

At the end of this subprotocol, both parties share the next state q^ of the automaton.

Subprotocol: k-th State Transition

In this step, A starts having r (k) and B holds r (k) =q (k) + r (k) mod \Q\.

1. A generates a random r a (i+1) G R Z Q , and blinds every element of the matrix δ with it. At the same time, A rotates the rows of δ , r (k) positions to the left, obtaining the matrix δ ^ with elements δ ^(i, j+ r[ k) mod | Q\) = A (i, j)+ r (k+l) mod \Q\ .

2. B generates a binary vector e^ of length \Q\, consisting in all zeros and a one at position r {k) . B encrypts this vector Efe^J and sends the encryptions to A.

3. A performs the vector product v^= δ (i) . e® under encryption, making use of the homomorphic properties of the encryption operation, obtaining the σ -length encrypted vector Efv^J. This result corresponds to an encryption of the column at position r[ k) of δ (i) , or equivalent, the column at position q® of δ , the blinded transition vector for the current state.

4. Both parties engage in an OT j , being A the sender and B the chooser, in which B gets the element with index Xk of Efv^J. This element corresponds to the encryption of

q ^ ) + r ^ moά \Q\, that can be recovered by B through decryption.

At the end of this subprotocol, both parties share the next state q^ +1 K

Subprotocol: Announcement of Result

Once all the elements of x have been consumed by A's FSM, the last step determines whether the reached state is a final state. Again, A starts with r a {N) and B holds

1. A generates a random binary vector/ as f(j + r < a N) m °d \Q\) = D e FLy = 0, . . . , \Q\ - 1, whose Boolean elements encode whether a state y is a final state, having ones in the indices corresponding to acceptance states and zeros in those indices corresponding to non-acceptance states. This vector is shifted, so that the index r^ N) that B possesses represent the position of the acceptance of the actual final state. 2. Both parties engage in an OT f ' , being A the sender and B the chooser, in which B gets the element with r^ N) of/ This element gives the binary output of the FSM.

4.3 Complexity Evaluation

As 1-out-of-m oblivious transfer can be implemented with linear communication complexity, the communication complexity for each subprotocol is + 1∑|) . Furthermore, as one subprotocol needs to be performed for each symbol of the input string, the communication complexity for obliviously running a FSM on an input of length N is 0(N • (\Q\ + 1∑|)) . Thus, the complexity is linear in the number of states and in the size of the input alphabet, instead of linear in their product. This implies a great improvement in complexity with respect to generic approaches for big input alphabets and high number of states.

Regarding the computational complexity, the OT protocol of Naor and Pinkas

[19] is used for implementing OT 1 " , as this protocol has an amortized complexity oϊθ(m) products for the sender and 0(1) products for the chooser. With these magnitudes for the OT subblocks, it is easy to see that the total complexity for A is 0(N σ • |(?|) , being the matrix multiplication performed at each step the most costly operation. On the other hand, the amortized complexity for B is just 0(N- 1(?|) , being the encryption of the vector that

determines the shifted current state (e®) the most costly operation. It must be noted that the operations in the protocol can be transposed, in such a way that (e®) represent the current symbol, and the result of the matrix multiplication in step 3 of the k-th state transition subprotocol produces a vector of blinded next states for the current symbol. In this way, the roles of the dimensions \Q\ and | σ | are interchanged. Thus, the amortized complexity for B can be reduced to 0(N min( \Q\, |∑|)) .

4.4 Extension to Transducers

The basic protocol described above can be extended to transducers, while keeping the same order of complexity. This can be achieved by including some additional steps at each state transition, and omitting the last subprotocol (announcement of results).

Following the notation of Section 2.2, the modifications for implementing each type of transducers are described in the following:

• Moore machines: The output depends only on the current state, so A will have a vector λe (Z n )' β ' such that λ y holds the output for the statey.

For the initial step, the output is trivial (λ(qo)), and it can be sent to B. For the £-th state transition protocol, a modification must be made in its third step, in which besides the homomorphic calculation of δ (i) • e®, A also rotates λ λ ® G + r∞ πod \Q\) = λ, , computes the encryption of (λ (i) )' • e®, and sends the result (one encrypted scalar) to B.

• Mealy machines: In this case, the output depends on the state and the input, so A will have a matrix λe θCf ∑χ „ (Z n ) , such that A(i,j) gives the output for input i and state j-

In the first state transition, two OT 1 σ are run in parallel on vectors V^ and column qo of λ . This gives B the blinded first transition and the corresponding output.

For the k-th state transition protocol, its third step must be modified, such that besides the homomorphic calculation of δ (i) • e (i) , A also rotates λ + r ( a k) mod \Q\) = A(i,j) , and computes the encryption of w® = λ (i) • e (i) ; the following OT 1 σ protocol is run in parallel on both vectors V^ and W® .

4.5 Security Analysis

In the security framework established in appendix A, considering semi-honest parties

(neither party wants to deviate from the established protocol, but they can try to infer some information about the other party's input), one can state the following:

Claim 1. The proposed protocol privately evaluates A 's FSM on B 's input.

Proof. {Sketch) Firstly, it is straightforward to see that the output given by the protocol is the correct one, as if both parties follow the protocol, the transition function is correctly calculated at every step. With respect to the semi- honest model, it will be shown that the protocol is secure for both parties, according to Definition 1. As OT is a secure protocol, is may be assumed that simulators S c and S s , which produce views that are indistinguishable of those of the chooser and the sender respectively, exist. First a simulator for S A fork's view of the protocol will be constructed. Its input is (Q,σ,A,q o ,F,N) . For each step of the protocol, S A produces the random choices of A, a random encrypted vector e®, which is indistinguishable form a real e® due to the semantical security of the encryption, and uses S s to produce A 's view in the OT protocol.

For the case of B's view, its simulator uses S c to produce B's view of the OT protocols, and generates random encrypted vectors indistinguishable form v®, again thanks to the semantical security of the encryption. Finally, a binary random vector/is generated for which the proportion of ones has the same distribution as the probability of X B being accepted by a random \Q\ -states automaton. Thus, the final answer will also be indistinguishable from

the true one.

5. SECURE APPROXIMATE SEARCHING

In this section, it will be shown how the problems of privacy preserving approximate DNA searching and matching, as introduced in Section 2.1, can be solved using the protocol for oblivious automata execution.

5.1 Translating Searching into Finite Automata

Given a string X A , the method in [22] will be used for computing a finite automaton Lεv d (X A ) that accepts only the strings that are inside a closed ball of radius d around (X A ) using the Levenshtein distance as metric. The resulting minimal automaton is

denoted degree d Levenshtein automaton. By construction, Lεv d (X A ) is always acyclic. The language accepted by this automaton will be donated by L d (X A ). The algorithm for generating Lεv d (X A ) having X A is linear in time and space in the length of the string X A . In this way, the problem of calculating the Levenstein distance between two sequences X A and X B and comparing it to a given threshold d gets reduced to the execution of the computed automaton Lεv d (X A ) on input X B .

Once the Levenshtein automaton for a given sequence is generated, it may be extended so as to accept the language ∑ * L d (x)∑ * . This means that the resulting automaton accepts any string that contains as substring any of the sequences accepted by the Levenshtein automaton, thus solving the problem of approximate string searching.

The advantage of using an automaton instead of a dynamic programming algorithm resides in the fact that an automaton has predefined transitions, and it does not need any comparisons while traversing the input sequence. Comparisons are one of the most expensive operations under encryption (as they reduce to instances of the Millionaire's problem). By using a finite automaton, all the comparisons can be avoided, because they are all hard-wired in the automaton itself. Furthermore, using an automaton allows the implementation of any matching problem represented in the form of a regular expression, endowing our privacy-preserving solution with a strong generality.

Even though the construction of the Levenshtein automaton assures that the number of states of Lεv d (X A ) is linear in the length of the sequence X B , computing the extended automaton ∑ * L d (x) σ * will increase its number of states. Extending the Levenshtein automaton comprises two concatenations with σ * , the right one being trivial, as it only involves adding self-loops in all the final states. This right concatenation cannot increase the number of states of the automaton. Let us suppose that the Levenshtein automaton has n states, t of them being acceptance states; by construction, the automaton is acyclic. Applying the right concatenation, all of the t acceptance states collapse to only one sink acceptance state, and the rest of the states remain unaltered. Thus, the resulting automaton after the right concatenation has n - t + 1 states, one of them being the unique sink acceptance state, and the only cycles that the automaton has are the self loops in this state. Applying a known bound on the state complexity of the concatenation of regular languages [25], the left concatenation could increase the number of states of the automaton by at most 2" . Nevertheless, this bound is a worst case bound. The inventors

have found experimentally that the number of states usually grows linearly even after performing the left concatenation, resulting the number of states of the extended Levenshtein automaton being linear in the length of the input sequence X A . AS an example, Figure 1 shows the evolution of the number of states of the Levenshtein automaton Lεv d (X A ), as well as its extension to the language ∑ * L d (x)∑ * , as a function of the length of the sequence x, for threshold Levenshtein distances of 1 and 2 errors. The plot was obtained using 100 random DNA sequences x for each length and averaging the number of states of the obtained automata; it also shows the 95% confidence intervals. From this figure, it is clear that the state complexity usually is linear in the length of the input sequence. As a toy example, for the sequence x = [actg], the Levenshtein automaton for distance d = 1 is shown in Figure 2a, while the extension to cope with arbitrary length sequences is shown in Figure 2b. It is clear from this example that the extension to arbitrary length sequences does not necessarily imply an increase in the number of states of the automaton; in this case, it even supposes a reduction, due to the short length of the used pattern. Figure 2b also shows thicker arcs for the transitions that the automaton makes when inputting the sequence [ttcggcgctgga], where the pattern is present with one deletion, resulting in acceptance.

5.2 Secure Approximate Searching for DNA Sequences Referring to the scenario of DNA searching: Two parties A and B want to check if the DNA pattern X A (owned by A) is approximately present in B's DNA sequence X B , where |λ£|»|λU|. The case of matching is similar, except that |JC#|~|JC^|. The approximate presence means that the Levenstein distance between X A and some substring of X B is less than a given threshold d. To perform either matching or searching in a privacy-preserving manner, both parties execute the following steps:

1. A builds the Levenshtein automaton (Q, σ, δ, q 0 , F) corresponding to his sequence X A , given a maximum allowable distance d, following the procedure in [22], and minimizes it. If needed, the number of states can be partially concealed by adding a random number of dummy states.

2. In the case of a search, A extends the Levenshtein automaton by concatenating the Kleene closure of the alphabet σ * at the left and at the right (see Section 5.1). The

resulting automaton is then minimized. Again, a random number of dummy states can be added in order to partially conceal the number of states of the minimal automaton. 3. Both parties run the protocol presented in Section 4.2 with A 's automaton and i?'s sequence X B as inputs, in order to get a binary answer to the approximate matching or searching problem.

Regarding the complexity of the resulting protocol, we can combine the results obtained in Sections 4.3 and 5.1. By virtue of the former, the extended Levenshtein automaton usually has a state complexity linear O(n) in the length n of the input sequence X B ,' the latter shows that the private evaluation of an automaton with \Q\ states and an input alphabet σ on an input sequence of length N, has a communication complexity of

O(N (\Q\ + 1∑|)) . Finally, the application of the developed protocol for the approximate search of a sequence of length n in another sequence of length N incurs in a communication complexity of O(N n) . Concerning computational complexity, taking into account that the automaton transformation can be precomputed, and applying the same reasoning as for communication overhead, the total amortized computational complexity for the owner of the query n-length sequence X A is O(N n) , and for the owner of the long N- length sequence X B , it is 0(N). This means that for the party that makes the query, the privacy preserving protocol has a computational complexity in the same order as the one of the non-privacy preserving protocol, while the complexity for the other party is linear in her sequence's length, and does not depend on the length of the query string.

We can also make one final remark about round complexity. In Section 4.3 we have stated that the round complexity of the privacy-preserving protocol for the automaton evaluation is linear in the length of the input sequence (X B ). In this particular case, it is known that A 's Levenshtein automaton will accept only sequences of length smaller than \X A \ + d. Thus, if round complexity is a concern and the value \X A \ + d does not have to be kept secret, we can partition the input sequence X B into several consecutive blocks with an overlap of symbols, and run in parallel one instance of the oblivious automaton protocol per block. Then, a logical OR can be straightforwardly applied to the obtained (concealed) outputs. Taking the maximal number of blocks, the number of rounds of the resulting protocol does not depend on the length of the input sequence; as a counterpart, the overlaps produce an increase in communication complexity, which is quadratic in the number of states of the automaton. Between the two extreme cases, a tradeoff can be found, with a sublinear

round complexity in the length of the input sequence X B and a subquadratic communication complexity in the number of states of the automaton.

6. FURTHER APPLICATIONS As the protocol presented in Section 4 allows the efficient privacy-preserving execution of an automaton, it can be applied to any problem with a need of privacy preservation that can be stated in terms of a regular expression. There are plenty of applications where regular expressions are commonly used, like password format validation or data parsing. In general, a regular expression can indicate the format that a given text must conform to in order to be considered valid, and this is normally the first step of a validation process that protects the validator from entries that are out of domain and would likely cause errors. Whatever the validated information is, the need for privacy in the validator extends also to the need of privacy for the format checker.

Another typical application of regular expressions is file parsing, where some text is erased, substituted or inserted in some parts of the file; this can be done through the application of a finite automaton with output. When security is a concern the input text has to be protected, and the presented protocol may be applied. A specific case of the above is word or pattern finding in a document, a commonly used technique in spam checkers for electronic mail or virus analyzers. When dealing with confidential mails or private software, they must be protected from the party that runs the checker or analyzer. The application of the protocol for these scenarios is straightforward.

On the other hand, sequential transducers represent an efficient approach for large-scale dictionaries [15, 16], used for computational linguistics, in lexical analysis, morphology and phonology, syntax, text-to-speech synthesis, or speech recognition. All these applications can also be handled by the protocol presented in this work when there is the need of protecting the recognized sequence.

7. CONCLUSIONS

A protocol has been presented for the secure evaluation of finite state machines that may be used for privately solving any application that can be stated as a regular expression. A security framework has been established for the corresponding asymmetric secure function evaluation problem, and it has been proven that the protocol is secure under this framework. The protocol has also been shown to be efficient in terms of communication complexity, being this linear in the size of the input alphabet and in the

number of states of the FSM. This is a great improvement with respect to generic approaches, which can only achieve a communication complexity linear in the product of both magnitudes.

As the main application field, it has been shown how to use the developed protocol for secure DNA matching, with a similar complexity to previous privacy-preserving approaches to this problem, and overcoming several problems derived from the use of finite fields that these approaches had. Furthermore, the first efficient privacy preserving solution has been presented for error-resilient DNA searching.

Finally, due to the versatility of finite state machines, the presented protocol can also be used for privately solving any problem that involves matching a string against a regular expression, such as searching a DNA database with incomplete definitions, oblivious spam checkers and virus analyzers, and even private speech synthesis/analysis.

Appendix A. Security framework for oblivious automata evaluation

In this section a framework is defined in which security for both parties of the protocol is evaluated. Due to the asymmetry of the problem in terms of inputs to the protocol, its framework is slightly different than the commonly used for general two-party computation. Nevertheless, the problem can be restated into a two-party computation framework in the following way.

Let G be a functionality that, given the description of a function as a FSM /(■) ≡ (Q,σ,A,q o ,F) and its input x, gives as output G[ Then, the problem for the asymmetric function evaluation may be stated as a two-party computation problem in which party A holds the input /(•), and party B holds the input x; both parties want to evaluate G on their inputs.

Let us assume a semi- honest model in which the output of the protocol would be N for A and ( /(x), \Q\ ) for B, such that all information that each party can infer from the execution of the protocol about the other party's input is no more than what they could infer from their respective outputs. For A, the environment is exactly the same as in the general two-way computation framework, where he cannot decide about which input was given by B (inside the strings of length N with alphabet σ ). Nevertheless, for the case of B, one needs a different definition of what "information about A's input" means. In this case, one may informally consider that the protocol is secure for A if B cannot extract from his output more

information about the tuple (δ, q 0 , F) than he would be able to infer from the output of the automaton when this is run as a black box. More formally, one can state the following definition:

Definition 1 (security in the semi-honest model).

One can say that a protocol Il privately evaluates A 's FSM ( Q, σ, δ,q 0 , F ) on input B's string X B if, given the views for both parties ^ = ((QX,δ,q o ,F),m y ,...,m t , N) (1)

< = ((Q,σ,x B ),m ι ,...,m t ,f(x B )) (2) where m l is the z-th message interchanged between both parties, N is the length of X B and/(jc#) is the output of the automaton, there exist two polynomial time algorithms S A (Q, σ, δ, q 0 , F, N) and S B (Q, σ, x B ) such that their outputs are indistinguishable from the respective views of the parties, ^ λ ϊ ,qo,F,XB ≡ S A (Q^λ,q o ,F,N) (3) ^ } A ^ XB ≡ S B (Q,σ, X B ) (4)

B. Extension to malicious parties

In this disclosure, as well as in the security framework depicted in the previous appendix, only a semi- honest model has been considered. In this appendix, a brief comment is made on the malicious model for said security framework, and the modifications that must be applied to the protocol presented in Section 4 in order to fulfill the security requirements in the malicious case.

In this malicious model, the parties are allowed to deviate from the established protocol. As well as in the semi-honest model, the resulting information that a malicious party can get from the execution of the protocol must not be bigger than that obtained from its own input and the protocol's output, and the same simulator paradigm used in Definition 1 can be used for restarting this definition allowing one malicious party (if both parties are malicious, a security definition is useless).

The protocol presented in Section 4 can be complemented with several additional subb locks in order to grant security also in the malicious model. These subb locks are enumerated in the following:

Verifiable secret Sharing (VSS) Schemes: State information is held by both parties in the form of an additive share, with randomness generated by one party. Applying the same technique as in VSS, the correct generation of the shares can be proven. Non-Interactive Zero-Knowledge Proofs: For demonstrating that the homomorphic operations have been correctly performed, existing Zero-Knowledge proofs (of correct multiplication, correct addition and knowledge of the private operand) can be appended to the sent values, thus making negligible the probability of success of cheating party.

The addition of these proofs results in an increase in communication and computation complexity, but some of them can be precomputed before starting the protocol, and the complexity of those that cannot be precomputed does not depend on the size of the involved matrices and vectors, thus adding only a constant factor to the complexity of the semi-honest protocol, and preserving the order of complexity.

8. REFERENCES [1] Human gemome project, http://genomics.energy.gov.

[2] M. J. Atallah, F. Kerschbaum, and W. Du. Secure and private sequence comparisons. In

Proceedings of the 2003 ACM Workshop on privacy in the electronic society, pages 39-44,

Washington, DC, 2003. ACM Press.

[3] M. J. Atallah and J. Li. Secure outsourcing of sequence comparisons. International Journal of Information Security, 4(4):23-36, October 2005.

[4] R. E. Bellman. Dynamic Programming. Courier Dover Publications, 2003.

[5] B. Chor, S. Goldwasser, S. Micali, and B. Awerbuch. Verifiable secret sharing and achieving simultaneity in the presence of faults. In 25th Annual Symposium on Foundations of Computer Science FOCS'85, pages 383-395. IEEE Computer Society, 1985. [6] I. Damgard, M. Fitzi, E. Kiltz, J. B. Nielsen, and T. Toft. Unconditionally secure constant-rounds multi-party computation for equality, comparison, bits and exponentiation.

In Proceedings of the third Theory of Cryptography Conference, TCC 2006, volume 3876 of

Lecture Notes in Computer Science, pages 285-304. Springer- Verlag, 2006.

[7] O. Goldreich. Secure multi-party computation. Working Draft, 2002. [8] O. Goldreich, S. Micali, and A. Widgerson. How to play any mental game. In

Proceedings of the nineteenth annual ACM conference on Theory of Computing, pages 218—

229, New York, U.S.A., 1987. ACM Press.

[9] S. Goldwasser and S. Micali. Probabilistic encryption. Journal of Computer and System

Sciences, 28(2):270-299, April 1984.

[10] A. Hall. Corning soon: Your personal dna map? http://news.nationalgeographic.com/news/2006/03/0307 060307 dna.html .

[11] J. E. Hopcroft and J. D. Ullman. Introduction to automata theory, languages, and computation. Addison Wesley, 1979. [12] M. Jacobsson and A. Juels. Mix and match: Secure function evaluation via ciphertexts.

In T. Okamoto, editor, Advances in Cryptology - ASIACRYPT'OO, volume 1976 of Lecture

Notes in Computer Science, pages 162-177. Springer- Verlag, 2000.

[13] L. Kruger, S. Jha, E. -J. Goh, and D. Boneh. Secure function evaluation with ordered binary decision diagrams. In Proceedings of the 13th ACM conference on Computer and communications security CCS'06, pages 410-420, Virginia, U.S.A., November 2006. ACM

Press.

[14] V. I. Levenshtein. Binary codes capable of correcting deletions, insertions and reversals.

Doklady Akademii Nauk SSSR, 163(4):845-848, 1965. English translation at Soviet Physics

Doklady 10(8): 707-710, 1966. [15] M. Mohri. On some application of finite-state automata theory to natural language.

Natural Language Engineering, 2(1): 1-20, 1996.

[16] M. Mohri. Finite-state transducers in language and speech processing. Computational

Linguistics, 23(2):269-311, 1997.

[17] M. Naor and K. Nissim. Communication complexity and secure function evaluation. Electronic Colloquium on Computational Complexity (ECCC), 8(062), 2001.

[18] M. Naor and K. Nissim. Communication preserving protocols for secure function evaluation. In ACM Symposium on Theory of Computing, pages 590-599, 2001.

[19] M. Naor and B. Pinkas. Efficient oblivious transfer protocols. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 448-457, Washington, D.C., U.S.A., 2001.

[20] S. B. Needleman and C D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal on Molecular Biology,

48:443-453, 1970.

[21] P. Paillier. Public-key cryptosystems based on composite degree residuosity classes. In Advances in Cryptology - EUROCRYPT 1999, volume 1592 of Lecture Notes in Computer

Science, pages 223-238. Springer, 1999.

[22] K. U. Shulz and S. Mihov. Fast string correction with levenshtein automata.

International Journal of Document Analysis and Recognition (IJDAR), 5(l):67-85, 2002.

[23] T. K. Vintsyuk. Speech discrimination by dynamic programming. Kibernetika, 4:52-57, 1968.

[24] A. C. Yao. Protocols for secure computations. In Proceedings of the IEEE Symposium on Foundations of Computer Science, pages 160-164, 1982. [25] S. Yu and Q. Zhuang. The state complexities of some basic operations on regular languages. Theoretical Computer Science, 125:315-328, 1994.

Certain specific details of the disclosed embodiment are set forth for purposes of explanation rather than limitation, so as to provide a clear and thorough understanding of the present invention. However, it should be understood by those skilled in this art, that the present invention might be practiced in other embodiments that do not conform exactly to the details set forth herein, without departing significantly from the spirit and scope of this disclosure. Further, in this context, and for the purposes of brevity and clarity, detailed descriptions of well-known apparatuses, circuits and methodologies have been omitted so as to avoid unnecessary detail and possible confusion.

Reference signs are included in the claims, however the inclusion of the reference signs is only for clarity reasons and should not be construed as limiting the scope of the claims.