Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2016 Jun 7;12(6):e1004867.
doi: 10.1371/journal.pcbi.1004867. eCollection 2016 Jun.

An Introduction to Programming for Bioscientists: A Python-Based Primer

Affiliations

An Introduction to Programming for Bioscientists: A Python-Based Primer

Berk Ekmekci et al. PLoS Comput Biol. .

Abstract

Computing has revolutionized the biological sciences over the past several decades, such that virtually all contemporary research in molecular biology, biochemistry, and other biosciences utilizes computer programs. The computational advances have come on many fronts, spurred by fundamental developments in hardware, software, and algorithms. These advances have influenced, and even engendered, a phenomenal array of bioscience fields, including molecular evolution and bioinformatics; genome-, proteome-, transcriptome- and metabolome-wide experimental studies; structural genomics; and atomistic simulations of cellular-scale molecular assemblies as large as ribosomes and intact viruses. In short, much of post-genomic biology is increasingly becoming a form of computational biology. The ability to design and write computer programs is among the most indispensable skills that a modern researcher can cultivate. Python has become a popular programming language in the biosciences, largely because (i) its straightforward semantics and clean syntax make it a readily accessible first language; (ii) it is expressive and well-suited to object-oriented programming, as well as other modern paradigms; and (iii) the many available libraries and third-party toolkits extend the functionality of the core language into virtually every biological domain (sequence and structure analyses, phylogenomics, workflow management systems, etc.). This primer offers a basic introduction to coding, via Python, and it includes concrete examples and exercises to illustrate the language's usage and capabilities; the main text culminates with a final project in structural bioinformatics. A suite of Supplemental Chapters is also provided. Starting with basic concepts, such as that of a "variable," the Chapters methodically advance the reader to the point of writing a graphical user interface to compute the Hamming distance between two DNA sequences.

PubMed Disclaimer

Conflict of interest statement

The authors have declared that no competing interests exist.

Figures

Fig 1
Fig 1. Strings in Python: anatomy and basic behavior.
The anatomy and basic behavior of Python strings are shown, as samples of actual code (left panel) and corresponding conceptual diagrams (right panel). The Python interpreter prompts for user input on lines beginning with >>> (leftmost edge), while a starting denotes a continuation of the previous line; output lines are not prefixed by an initial character (e.g., the fourth line in this example). Strings are simply character array objects (of type str), and a sample string-specific method (replace) is shown on line 3. As with ordinary lists, strings can be ‘sliced’ using the syntax shown here: the first list element to be included in the slice is indexed by start, and the last included element is at stop-1, with an optional stride of size step (defaults to one). Concatenation, via the + operator, is the joining of whole strings or subsets of strings that are generated via slicing (as in this case). For clarity, the integer indices of the string positions are shown only in the forward (left to right) direction for mySnake1 and in the reverse direction for mySnake2. These two strings are sliced and concatenated to yield the object newSnake; note that slicing mySnake1 as [0:7] and not [0:6] means that a whitespace char is included between the two words in the resultant newSnake, thus obviating the need for further manipulations to insert whitespace (e.g., concatenations of the form word1+' '+word2).
Fig 2
Fig 2. Python’s scope hierarchy and variable name resolution.
As described in the text, multiple names (variables) can reference a single object. Conversely, can a single variable, say x, reference multiple objects in a unique and well-defined manner? Exactly this is enabled by the concept of a namespace, which can be viewed as the set of all nameobject mappings for all variable names and objects at a particular “level” in a program. This is a crucial concept, as everything in Python is an object. The key idea is that nameobject mappings are insulated from one another, and therefore free to vary, at different “levels” in a program—e.g., x might refer to object obj2 in a block of code buried (many indentation levels deep) within a program, whereas the same variable name x may reference an entirely different object, obj1, when it appears as a top-level (module-level) name definition. This seeming ambiguity is resolved by the notion of variable scope. The term scope refers to the level in the namespace hierarchy that is searched for nameobject mappings; different mappings can exist in different scopes, thus avoiding potential name collisions. At a specific point in a block of code, in what order does Python search the namespace levels? (And, which of the potentially multiple nameobject mappings takes precedence?) Python resolves variable names by traversing scope in the order LEGB, as shown here. L stands for the local, innermost scope, which contains local names and is searched first; E follows, and is the scope of any enclosing functions; next is G, which is the namespace of all global names in the currently loaded modules; finally, the outermost scope B, which consists of Python’s built-in names (e.g., int), is searched last. The two code examples in this figure demonstrate variable name resolution at local and global scope levels. In the code on the right-hand side, the variable e is used both (i) as a name imported from the math module (global scope) and (ii) as a name that is local to a function body, albeit with the global keyword prior to being assigned to the integer -1234. This construct leads to a confusing flow of logic (colored arrows), and is considered poor programming practice.
Fig 3
Fig 3. Sample flowchart for a sorting algorithm.
This flowchart illustrates the conditional constructs, loops, and other elements of control flow that comprise an algorithm for sorting, from smallest to largest, an arbitrary list of numbers (the algorithm is known as “bubble sort”). In this type of diagram, arrows symbolize the flow of logic (control flow), rounded rectangles mark the start and end points, slanted parallelograms indicate I/O (e.g., a user-provided list), rectangles indicate specific subroutines or procedures (blocks of statements), and diamonds denote conditional constructs (branch points). Note that this sorting algorithm involves a pair of nested loops over the list size (blue and orange), meaning that the calculation cost will go as the square of the input size (here, an N-element list); this cost can be halved by adjusting the inner loop conditional to be “j < N − i − 1”, as the largest i elements will have already reached their final positions.

References

    1. Metzker ML. Sequencing Technologies—The Next Generation. Nature Reviews: Genetics. 2010. January;11(1):31–46. 10.1038/nrg2626 - DOI - PubMed
    1. Larance M, Lamond AI. Multidimensional Proteomics for Cell Biology. Nature Reviews: Molecular Cell Biology. 2015. May;16(5):269–280. 10.1038/nrm3970 - DOI - PubMed
    1. Wang Z, Gerstein M, Snyder M. RNA-Seq: A Revolutionary Tool for Transcriptomics. Nature Reviews: Genetics. 2009. January;10(1):57–63. 10.1038/nrg2484 - DOI - PMC - PubMed
    1. Wishart DS. Computational Approaches to Metabolomics. Methods in Molecular Biology. 2010;593:283–313. 10.1007/978-1-60327-194-3_14 - DOI - PubMed
    1. OMICS: A Journal of Integrative Biology;. Available from: http://www.liebertpub.com/overview/omics-a-journal-of-integrative-biolog....

Publication types