Sequence Analysis Primer
Latest Publications


TOTAL DOCUMENTS

4
(FIVE YEARS 0)

H-INDEX

0
(FIVE YEARS 0)

Published By Oxford University Press

9780195098747, 9780197560907

Author(s):  
David J. States ◽  
Mark S. Boguski

Properly approached, molecular sequence data is a rich source of knowledge capable of teaching us much about the structure, function, and evolution of biological macromolecules. To effectively realize this potential, however, some understanding of the process of and theoretical basis for sequence comparison is needed as well as a variety of practical tools to access and manipulate the data. The volume of molecular sequence data has long since surpassed human information processing capacity for even simple tasks such as searching for related sequences, and with the ever increasing rate at which new sequences are being produced, the need for computer-assisted analysis becomes more and more acute. Automated tools can extend human capabilities by orders of magnitude in both speed and accuracy. The educated application of these automated tools is an essential part of modern molecular biology research. This chapter considers the theory and practice of analyzing sequence similarity as it applies to database searching and sequence alignment. Five major areas will be examined. First, we describe the use of dot matrix plots to elucidate the structures and features relating a sequence pair. Secondly, we discuss optimal pairwise alignment of sequences using dynamic programming algorithms. Thirdly, we examine fast, approximate techniques for detecting local similarities. Fourthly, the uses of and techniques for multiple sequence alignment are described. Finally, the statistical significance of sequence similarity is considered. In the analysis of molecular sequences, the terms similarity andhomology are often used without a clear understanding of their distinct implications. Similarity is a descriptive term which only implies that two sequences, by some criterion, resemble each other and carries no suggestion as to their origins or ancestry. Homology refers specifically to similarity due to descent from a common ancestor (Patterson, 1988;Reeck etal., 1987). On the basis of similarity relationships among a group of sequences, it may be possible to infer homology, but outside of an explicit laboratory model system, descent from a common ancestor remains hypothetical. There are philosophical issues in the inference of homology as well as practical ones. In classical morphology, conjunction (the occurrence of two traits in a single individual) is considered evidence that they are not homologous (Patterson, 1982).


Author(s):  
Roland Lüthy ◽  
David Eisenberg

Given a protein sequence, the amino acid composition can be determined by counting the number of residues of each type. Then a molecular weight can be calculated by summing the molecular weights of the individual amino acid residues, taking into account the loss of one H2O molecule per peptide bond. Table 1 lists the molecular weights of the twenty amino acids and water. This approach assumes that the protein has not been covalently modified. Because of extensive glycosylation of some proteins, this approach can significantly underestimate the actual molecular weight. With the pKa values of Table 1, it is possible to calculate the theoretical charge of a protein at a given pH by summing the charges of the amino acid side chains and of the amino terminus and carboxyl terminus. By performing this calculation over a pH range, one obtains a theoretical titration curve and an isoelectric point (the pH at which the protein hasanetchargeof zero). This method assumes that all normally titratable groups are accessible to water, and that all side chains have the intrinsic pKa values listed in Table 1. This assumption is not completely correct, and consequently, the theoretical isoelectric point may differ from the experimentally determined value. Figure 1 shows the calculated titration curve for pancreatic ribonuclease: the calculated isoelectric point is 8.2, whereas the measured value is 9.6 (Lehninger, 1977). The calculation of extinction coefficients (Gill and von Hippel, 1989) is performed in much the same way as that of the isoelectric point Individual residues are treated as if they are free amino acids, and the overall extinction coefficient is calculated as the sum of the extinction coefficients of the residues. The same basic assumption is made: Residues are assumed to be in typical environments and not to show unusual absorption due to their local environments. In the case of the extinction coefficient, however, this assumption seems to be generally acceptable; calculated extinction coefficients are typically within a few percent of the experimentally determined value, and errors of more than 15% are rare (Gill and von Hippel, 1989).


Author(s):  
Peter M. Rice ◽  
Keith EHiston

Software packages are available for all common laboratory computer systems. The packages for personal computers (PC or Macintosh) are able assemble and correct the sequence, those for the larger systems (VAX or Unix) are generally able to analyze the sequence in greater detail. Most laboratories will be able to use sequence assembly programs in their favorite sequence analysis software package. In general, the stages of sequence assembly are gel entry, overlap detection, editing, and reporting. The available programs differ in the ways they handle each of these tasks. No single package is ideal, though all should be adequate for a smaller project such as a single cDNA. Particular attention should be given to the quality and features of the editor, as this is where most time will be spent, and to the possibilities of extending the software to cope with problems that may arise. Good status reports and a choice of methods for overlap detection can save considerable time in resolving ambiguities and correcting errors later. Figure 1 lists some of the commonly used sequence assembly programs. The prices vary widely depending on the features of the package and the options for academic or commercial licenses. Originally, each package used its own “special” codes to represent ambiguous bases and gaps in sequences. Mostpackages now use the standard IUB-IUPAC codes (Figure 2) for the nucleotides, though the program documentation should be checked before starting the project. The task of sequence reading depends on the sequencing protocol used. In many laboratories the sequence is generated on an autoradiograph (Figure 3) from which the sequence is read. Although automated gel readers are on the market, most sequence data is read manually with the aid of a digitizer. Most sequence assembly programs accept DNA sequence read by a sonic digitizer. An example of a device which is supported by most of the available programs is the GrafBar GP-7 [Science Accessories Corporation, Southport, CT, US A and P.M.S. (Instruments) Ltd., Waldeck House, Reform Road, Maidenhead, Berks, SL6 8BX, UK]. Sonic digitizers have a stylus to point to locations on an autoradiograph, which is illuminated from below by a light box.


Author(s):  
Lisa Caballero

The Notch sequence from Drosophila is used as the sample data. Notch is thought to control cell fate decisions in development It encodes a large, transmembrane protein which may function through cell adhesion, and it was cloned and sequenced in 1985(WhartonetaI., 1985a&b; Kidd et al., 1986). Notch is an ideal sequence to analyze because it contains many features that computers are good at finding. Figure 1 shows a schematic of the Notch protein and its major features. The Notch sequence is available in the Genbank and EMBL sequence database under accession numbers M16153, M16149, M16150, M16151 and M16152 (see Appendix VIS). The most successful way to approach this chapter is to reproduce the analyses. This will familiarize one with a specific software package, and offers a more accurate picture of the volume of output data produced by many programs than could be allowed in the figures in this chapter. Programs for most of the analyses used in this chapter are widely available on IBM PCs, Macintoshes and mainframe computers. The examples have intentionally been kept generic, but programs from the following sources were used: Genetics Computer Group Sequence Analysis Software (Devereux et al., 1984; Genetics Computer Group Inc., Madison, WI),Genbank Online Services (Benton,1990),NationalLibrary of Medicine Services (Benson et al., 1990), and PC Gene (A. Bairoch, University of Geneva; ™ Intelligenetics Inc., Mountain View, CA and Genofit SA). Unless a researcher is studying nontranslatable segments of DNA, the immediate goal upon the isolation of a new gene is usually to deduce the amino acid sequence of its product. The laboratory approach might go from isolating a cDNA clone, determining its nucleotide sequence, locating alarge open reading frame, and translating the sequence into a putative protein. In this case priority is usually given to analyzing the putative protein, with promoter regions introns being sequenced later to elucidate gene regulation. The organization of the following example analysis of Notch reflects these laboratory priorities by beginning with cDNA analysis, moving to protein analysis, and then returning to DNA analysis for the genomic sequence.


Sign in / Sign up

Export Citation Format

Share Document