USING DNA ALGORITHMS TO SOLVE NP-COMPLETE PROBLEMS

A summary and annotated bibliography written by Charlotte Miller for a presentation given on November 5, 1996 for the course CS 881a - Topics in Computing and Biology: DNA Computing given by Prof. Lila Kari.

SUMMARY

Introduction

A decision problem is in the class NP if:

there is no known algorithm for it that will execute in polynomial time on a conventional computer
it can be solved in polynomial time using a non-deterministic computer (one with inherently unlimited parallelism).

A problem is intractable if no polynomial time algorithm can possibly be devised for it.

A problem in NP is NP-complete if every other problem in NP can be expressed in terms of it by means of a polynomial time algorithm. NP-complete problems are considered to be the hardest problems, since if any problem in NP is shown to be intractable then all NP-complete problems are intractable. However, if any NP-complete problem can be solved in polynomial time, all problems in NP become tractable.

A non-deterministic computer can be simulated by a test tube of DNA strands using the techniques of molecular biology to implement operations on them. In theory, many otherwise intractable problems can be implemented in polynomial time on such a DNA computer. In fact, Adleman's simulation of the Hamiltonian Path problem, known to be NP-complete, has an execution time that is linear in the number of vertices of the graph. This is the reason that DNA computing has become a new focus of research in computer science.

Following Adleman's success, polynomial time algorithms for other NP-complete problems, such as SAT, Independent Set and 3-colorability have been published. So far, however, a number of problems have made the implementation of DNA algorithms impractical for all but small instances of the problems. The various problems are discussed briefly in the following sections.

Biological Operations

A number of proposals have been made concerning what operations are sufficient to describe and carry out a DNA computation. There seems to be a rough consensus that the operations Extract, Amplify, Merge and Detect form a basic set for decision problems. Melting and/or Annealing are used in the laboratory implementations of Extract, Amplify and Detect. Separation by length is used in the detection step of some optimization problems. Append has been suggested for the construction of the input set of DNA strands as well as a Cut operation on strands at a particular site. Split has been proposed to divide a sample into several smaller samples of identical composition. Intersection and Match are independently proposed but essentially equivalent operations with quite different implementations.

Errors

Most of the algorithms given assume that the biological operations they employ are error-free. In practice, however this is not the case: PCR amplification can incorporate base mismatches and extraction techniques are known not to be 100% efficient either at removing all "correct" strands or at not removing "incorrect" ones. The gradual deterioration of DNA over time is another error source that has not been addressed at all.

An open problem is how to select a coding scheme for the oligonucleotides that encode a particular problem instance. If one code contains a substring that even approximately matches a substring of another code or sequence of codes, incorrect encodings of, for example, paths through a graph may be present in the input DNA strands.

Solution Space

The DNA algorithms presented by Adleman for Hamiltonian Path and 3-Coloring, by Lipton for SAT, and by Jonoska and Karl for Road Coloring are formulated as exhaustive searches of a solution space that is exponential in size. All of these algorithms can be executed in polynomial time due to the massive parallelism inherent in DNA computation. Unfortunately, they are limited in applicability to small instances of these problems because they require the generation of an unrestricted solution space. For example, the DNA encoding of all paths through a connected graph of 200 nodes would weigh more than the earth. For the size of graph for which Adleman's algorithm is feasible, it is possible to find a solution with an electronic computer using a more efficient algorithm.

Even encoding information at the submolecular level cannot offset the exponential growth rate of solution space with instance size. This raises the question of how useful DNA computing can be.

Another limiting problem is that of strand length. Large problem instances may require very long strands of DNA to encode them. There is an upper limit to the length of strand that can be efficiently separated by gel electrophoresis.

THE PAPERS

E. Bach, A. Condon, E. Glaser and C. Tanguay, DNA Models and Algorithms for NP-complete Problems

This paper presents a method of reducing the solution space size by tailoring the set of encoded strings used as input to the particular instance to be solved. To do this, a problem-specific generator algorithm is constructed. The generator is modeled as a directed acyclic graph with edges corresponding to test tubes and vertices to one of the DNA operations Merge, Extract, Append, Split, and No-op. The graph is arranged as a set of levels, with all the edges starting from vertices in one level ending at vertices in the next. Beginning with a test tube containing a finite set of empty strings, the generator builds the solution set by performing the operations at each level in turn, starting at the root and ending at the sink vertex.

New algorithms for 3-Coloring and Independent Set are presented with improved but still exponential bounds on the solution space sizes. They are described in terms of the solution space generated and the computations required to search it.

This technique allows the practical solution of 3-Coloring and Independent Set problem instances with almost twice the number of vertices as algorithms that require an exhaustive search. The solution spaces generated still grow exponentially with graph size, however. The limit on the size of an NP problem that can be handled by DNA computation is pushed back by the generator approach, but not eliminated.

R. J. Lipton, DNA Solution of Hard Computational Problems

In the first part of this paper, an algorithm is presented for Satisfiability (SAT) an NP-complete problem from Boolean logic. SAT asks if there is an assignment of truth values (0 or 1) to the variables of a Boolean formula in conjunctive normal form such that the formula is true (evaluates to 1). For a formula with n variables there are 2n distinct sets of truth assignments (n-bit sequences) to be tested. A general description of a graph is given such that the paths through the graph represent all n-bit binary numbers. These paths are encoded as double stranded DNA using the same encoding method used by Adleman for the Hamiltonian Path problem. An algorithm for SAT is given in which the number of steps is linear in the number of clauses in the formula. The reliability of the algorithm does not depend on the Extract operation isolating every sequence with the desired property.

In the second part, an algorithm is given for SAT for contact networks, which includes generalized SAT, that is, SAT for any Boolean formula using only the operations of negation, AND, and OR. The number of steps is linear in the number of operations in the formula.

The importance of this result is that it provides a general method for solving NP problems that can be expressed as Boolean formulas.

D. Boneh, C. Dunworth, R. J. Lipton and J. Sgall, On the Computational Power of DNA

This paper combines a survey of DNA computing algorithms with new results, including generalizing the results of Lipton to SAT for general Boolean circuits with arbitrary binary gates and methods for solving NP-hard optimization problems.

Another important contribution of this paper is the presentation of an encoding of an arbitrary finite state automaton. This is useful when the solution to a problem is known to belong to a particular subset of the set of all combinatorial possibilities, such as a regular language. Fewer biological operations are required by the algorithm to eliminate the incorrect answers when the solution space can be restricted in this way.

The encoding of the sequences used for input to an algorithm is somewhat more complex than that used by Adleman. The emphasis here is to reduce the execution time of the algorithm, although this is clearly one way to reduce the space requirements as well. This is curious because the authors take great care to compare algorithms on the basis of both time and space complexity in their survey.

The general conclusion of the survey is that most DNA algorithms have execution times linear in the size of the simulated object (e.g., number of gates in a circuit) and space requirements exponential in the number of input variables (e.g., number of Boolean variables in a formula).

The authors present a modification of the encoding of input sequences to extend the use of DNA algorithms in an efficient manner to optimization problems. In SAT, both 0's and 1's are encoded as sequences of equal length. By using a longer sequence of bases to encode 1's than 0's, the satisfying assignment containing the greatest number of 1's can be found using gel electrophoresis on the final test tube containing all valid assignments. Without the encoding information, a binary search would have to be performed.

N. Jonoska and S. A. Karl, A Molecular Computation of the Road Coloring Problem

In this paper, two DNA algorithms are presented for the Road Coloring problem, a problem that is in NP but not known to be NP-complete. In essence, the problem asks whether or not a given strongly connected aperiodic directed graph with out-degree k is synchronizable. That is, is there a deterministic k-coloring of the edges of the graph such that starting from any vertex in the graph, there is a directed path having a specific sequence of colors that will always lead to a particular destination vertex. (A deterministic coloring is one in which, for each vertex, no two outgoing edges have the same color.) This unique sequence of colors is called a synchronizing word for the destination vertex. A graph is synchronizable if there is a unique synchronizing word for each vertex.

The first algorithm solves the simpler problem of: given a graph that is already colored, is the coloring synchronizing? The algorithm tests whether or not there is a synchronizing word for one arbitrary vertex. Its time complexity is linear in the number of vertices (O(n)).

The second algorithm solves the general problem given the above for a graph with out-degree 2. A path containing more than one occurrence of an edge must have that edge colored consistently throughout. Therefore the algorithm is complicated by the need to reject paths with inconsistent colorings as well as the need to make sure that matching color sequences refer to the same graph coloring. This algorithm is O(n²).

A detailed discussion of the laboratory techniques required to implement the steps of the algorithms is given. Included is a procedure for implementing a new operation, called Match, that is used in both algorithms to compare color sequences. Match has not yet been implemented in the laboratory. It involves both an extraction operation and PCR amplification. There is concern that using it repetitively as in the second algorithm will lead to an amplification of errors.

Factors that limit the applicability of DNA computation to the Road Coloring problem include strand length, weight of the initial solution of DNA components, and the need for a useful choice of encoding. There is an upper limit on the length of DNA strands that can be separated by gel electrophoresis and careful encoding is necessary to prevent incorrect matching of oligos when the paths are formed.

D. Boneh and R. J. Lipton, Making DNA Computers Error Resistant

This paper examines causes of error in DNA computations. A method is proposed of altering the algorithms for problems in a particular class within NP for which the number of strands (volume) of DNA either remains the same throughout the computation or decreases uniformly. The following sources of error are cited:

extraction is only about 95% efficient and sometimes removes the wrong strands, and
DNA strands deteriorate over time.

The authors claim that extraction is the most important operation. Whether this is because it is used so frequently or because it is the greatest source of error is not made clear. A modification for decreasing volume algorithms, such as the Hamiltonian Path and SAT algorithms, that will prevent the accumulation of errors due to repeated extractions is presented. First, the number of strands of DNA is doubled using PCR every s/n steps, where s is the number of steps in the algorithm and n is the number of input variables. Second, assuming that the final test tube contains some incorrect strands, in the detection step one must keep sequencing strands and testing them for correctness until a valid solution is found or some limit on the number of sequencing operations is reached. Clearly this last modification is suitable for decision problems but it is hard to see how it could be useful for optimization problems. This method can be extended to a constant volume algorithm, such as Circuit Satisfiability, by finding a new decreasing volume algorithm for the same problem and applying the modifications to it.

A new DNA operation called Intersection is proposed to simulate an AND gate. It is essentially the same as the Match operation for the Road Coloring problem, but is less efficient since it requires multiple extractions.

No modifications are suggested to prevent the deterioration of DNA over time. Unfortunately, since the suggested periodic reamplifications and repeated sequencing add to the length of an algorithm, reducing extraction errors may make deterioration more likely.

REFERENCES

L. M. Adleman, “On constructing a molecular computer.”

L. M. Adleman, “Molecular computation of solutions to combinatorial problems.” Science V. 266 (November 1994), 1021-1023. Q1.S35

E. Bach, A. Condon, E. Glaser and C. Tanguay, “DNA models and algorithms for NP-complete problems.” March 1996.

D. Boneh, C. Dunworth, R. J. Lipton and J. Sgall, “On the computational power of DNA.”

L. D. Boneh and R. J. Lipton, “Making DNA computers error resistant.” Proceedings of the 2nd Annual Meeting of DNA based computers, 1996.

M. Garey and D. Johnson, Computers and Intractability. A guide to the theory of NP-completeness. W. H. Freeman and Company, San Francisco, 1979. QA76.6.G35

N. Jonoska and S. A. Karl, “A molecular computution of the Road Coloring problem.” Proceedings of the 2nd Annual Meeting of DNA based computers, 1996. 148-159. Princeton, 1996.

L. Kari, “DNA computers: tomorrow's reality.” Bulletin of EATCS (summer 1996).

R. J. Lipton, “DNA solution of hard computational problems.” Science V. 268 (April 1995), 542-545. Q1.S35