OPTIMAL ALIGNMENT AND LCS


In computational genetics and computational linguistics, one of the basic problems is to find an optimal alignment between two given sequences (texts), X and Y. This requires a scoring system which can rank the alignments. Typically a substitution matrix gives the score for each possible pair of letters. The total score of an alignment is the sum of terms for each aligned pair of residues, plus terms for each gap. Sometimes we consider the special case where the substitution matrix is equal to the identity matrix and the gap-penalty is zero. In this case, the optimal score is equal to the length of the Longest Common Subsequence (LCS) of the two texts X and Y. (A common subsequence of X and Y is a sequence which is a subsequence of X as well as of Y). We frequently refer to the length of the LCS of X and Y as the optimal LCS-score.

Fluctuations

A long open standing problem in the LCS context is to determine the exact order of the fluctuation of the LCS length. For this we take the text X and Y to be i.i.d. and independent of each other. We solved the long open problem of the order of the fluctuation of the LCS and the OA in several important cases. Here are some of our results:

Expected value

One of the main ideas which allowed all the subsequent developements is contained in: Large deviation approach to LCS. The idea is to identify each alignment with an increasing sequence of stopping times. Then, one tries to figure out a low-entropy class of stopping times which is likely to contain the OA. The rest is large deviation in order to control what happens along the alignments from the class. With this approach we beat the existing bounds for the LCS-constant. The same technique is applied to the LCS curve in: The expected curve.

Structure of the optimal alignment path

We developed a set of techniques which yield results on the microscopical and macroscopical structure of the optimal path, as well as the transversal and longitudinal fluctuations. Furthermore, we discovered deep relations between these entities. Some of these results are contained in: