Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2011 Feb 15;12 Suppl 1(Suppl 1):S55.
doi: 10.1186/1471-2105-12-S1-S55.

Closest string with outliers

Affiliations

Closest string with outliers

Christina Boucher et al. BMC Bioinformatics. .

Abstract

Background: Given n strings s1, …, sn each of length ℓ and a nonnegative integer d, the CLOSEST STRING problem asks to find a center string s such that none of the input strings has Hamming distance greater than d from s. Finding a common pattern in many--but not necessarily all--input strings is an important task that plays a role in many applications in bioinformatics.

Results: Although the closest string model is robust to the oversampling of strings in the input, it is severely affected by the existence of outliers. We propose a refined model, the closest string with outliers (CSWO) problem, to overcome this limitation. This new model asks for a center string s that is within Hamming distance d to at least n - k of the n input strings, where k is a parameter describing the maximum number of outliers. A CSWO solution not only provides the center string as a representative for the set of strings but also reveals the outliers of the set.We provide fixed parameter algorithms for CSWO when d and k are parameters, for both bounded and unbounded alphabets. We also show that when the alphabet is unbounded the problem is W[1]-hard with respect to n - k, ℓ, and d.

Conclusions: Our refined model abstractly models finding common patterns in several but not all input strings. We initialize the study of the computability of this model and show that it is sensitive to different parameterizations. Lastly, we conclude by suggesting several open problems which warrant further investigation.

PubMed Disclaimer

Figures

Figure 1
Figure 1
An illustration of the reduction from CLIQUE to CSWO Example of the reduction from a CLIQUE instance G with t = 3 to an instance of CSWO with 12 strings with ℓ = t = 3, d = t – 2 = 1, and formula image. In bold we have the set of strings S* = {s121, s132, s234} that corresponds to the clique containing the vertices {v1,v2,v3}. We note that S* is the only set such that |S*| = 3, which have Hamming distance at most d from s*.

References

    1. Dopazo J, Rodríguez A, Sáiz J, Sobrino F. Design of primers for PCR amplification of highly variable genomes. Computer Applications in the Biosciences. 1993;9:123–125. - PubMed
    1. Lanctot J, Li M, Ma B, Wang S, Zhang L. Distinguishing string selection problems. Information and Computation. 2003. pp. 41–55. - DOI
    1. Lucas K, Busch M, Össinger S. Thompson J: An improved microcomputer program for finding gene-and gene family-specific oligonucleotides suitable as primers for polymerase chain reactions or as probes. Computer Applications in the Biosciences. 1991;7:525–529. - PubMed
    1. Proutski V, Holme E. Primer master: A new program for the design and analyiss of PCR primers. Computer Applications in the Biosciences. 1996;12:253–255. - PubMed
    1. Deng X, Li G, Li Z, Ma B, Wang L. Genetic design of drugs without side-effects. SIAM Journal on Computing. 2003;32(4):1073–1090. doi: 10.1137/S0097539701397825. - DOI

Publication types

LinkOut - more resources