A Novel Test for Absolute Fit of Evolutionary Models Provides a Means to Correctly Identify the Substitution Model and the Model Tree

Vadim Goremykin

doi:10.1093/gbe/evz167

A Novel Test for Absolute Fit of Evolutionary Models Provides a Means to Correctly Identify the Substitution Model and the Model Tree

Genome Biology and Evolution ◽

10.1093/gbe/evz167 ◽

2019 ◽

Vol 11 (8) ◽

pp. 2403-2419

Author(s):

Vadim Goremykin

Keyword(s):

Sequence Data ◽

Evolutionary Model ◽

Error Rates ◽

Tree Topology ◽

Character State ◽

Model Data ◽

Substitution Model ◽

Power Of The Test ◽

Wide Range ◽

Correct Tree

Abstract A novel test is described that visualizes the absolute model-data fit of the substitution and tree components of an evolutionary model. The test utilizes statistics based on counts of character state matches and mismatches in alignments of observed and simulated sequences. This comparison is used to assess model-data fit. In simulations conducted to evaluate the performance of the test, the test estimator was able to identify both the correct tree topology and substitution model under conditions where the Goldman–Cox test—which tests the fit of a substitution model to sequence data and is also based on comparing simulated replicates with observed data—showed high error rates. The novel test was found to identify the correct tree topology within a wide range of DNA substitution model misspecifications, indicating the high discriminatory power of the test. Use of this test provides a practical approach for assessing absolute model-data fit when testing phylogenetic hypotheses.

Download Full-text

mtDNAcombine: tools to combine sequences from multiple studies

BMC Bioinformatics ◽

10.1186/s12859-021-04048-0 ◽

2021 ◽

Vol 22 (1) ◽

Author(s):

Eleanor F. Miller ◽

Andrea Manica

Keyword(s):

Sequence Data ◽

Data Extraction ◽

Bayesian Skyline Plot ◽

Model Organisms ◽

Data Sets ◽

Data Handling ◽

Online Database ◽

Genetic Studies ◽

Wide Range ◽

Existing Data

Abstract Background Today an unprecedented amount of genetic sequence data is stored in publicly available repositories. For decades now, mitochondrial DNA (mtDNA) has been the workhorse of genetic studies, and as a result, there is a large volume of mtDNA data available in these repositories for a wide range of species. Indeed, whilst whole genome sequencing is an exciting prospect for the future, for most non-model organisms’ classical markers such as mtDNA remain widely used. By compiling existing data from multiple original studies, it is possible to build powerful new datasets capable of exploring many questions in ecology, evolution and conservation biology. One key question that these data can help inform is what happened in a species’ demographic past. However, compiling data in this manner is not trivial, there are many complexities associated with data extraction, data quality and data handling. Results Here we present the mtDNAcombine package, a collection of tools developed to manage some of the major decisions associated with handling multi-study sequence data with a particular focus on preparing sequence data for Bayesian skyline plot demographic reconstructions. Conclusions There is now more genetic information available than ever before and large meta-data sets offer great opportunities to explore new and exciting avenues of research. However, compiling multi-study datasets still remains a technically challenging prospect. The mtDNAcombine package provides a pipeline to streamline the process of downloading, curating, and analysing sequence data, guiding the process of compiling data sets from the online database GenBank.

Download Full-text

The Evolution of Life Modes in Stictidaceae, with Three Novel Taxa

Journal of Fungi ◽

10.3390/jof7020105 ◽

2021 ◽

Vol 7 (2) ◽

pp. 105

Author(s):

Vinodhini Thiyagaraja ◽

Robert Lücking ◽

Damien Ertz ◽

Samantha C. Karunarathna ◽

Dhanushka N. Wanasinghe ◽

...

Keyword(s):

New Species ◽

Large Scale ◽

Sequence Data ◽

Large Subunit ◽

Monte Carlo Sampling ◽

Small Subunit ◽

Sensu Stricto ◽

Internal Transcribed Spacers ◽

Character State ◽

Phenotypic Data

Ostropales sensu lato is a large group comprising both lichenized and non-lichenized fungi, with several lineages expressing optional lichenization where individuals of the same fungal species exhibit either saprotrophic or lichenized lifestyles depending on the substrate (bark or wood). Greatly variable phenotypic characteristics and large-scale phylogenies have led to frequent changes in the taxonomic circumscription of this order. Ostropales sensu lato is currently split into Graphidales, Gyalectales, Odontotrematales, Ostropales sensu stricto, and Thelenellales. Ostropales sensu stricto is now confined to the family Stictidaceae, which includes a large number of species that are poorly known, since they usually have small fruiting bodies that are rarely collected, and thus, their taxonomy remains partly unresolved. Here, we introduce a new genus Ostropomyces to accommodate a novel lineage related to Ostropa, which is composed of two new species, as well as a new species of Sphaeropezia, S. shangrilaensis. Maximum likelihood and Bayesian inference analyses of mitochondrial small subunit spacers (mtSSU), large subunit nuclear rDNA (LSU), and internal transcribed spacers (ITS) sequence data, together with phenotypic data documented by detailed morphological and anatomical analyses, support the taxonomic affinity of the new taxa in Stictidaceae. Ancestral character state analysis did not resolve the ancestral nutritional status of Stictidaceae with confidence using Bayes traits, but a saprotrophic ancestor was indicated as most likely in a Bayesian binary Markov Chain Monte Carlo sampling (MCMC) approach. Frequent switching in nutritional modes between lineages suggests that lifestyle transition played an important role in the evolution of this family.

Download Full-text

Sequence data from isolated lichen-associated melanized fungi enhance delimitation of two new lineages within Chaetothyriomycetidae

Mycological Progress ◽

10.1007/s11557-021-01706-8 ◽

2021 ◽

Vol 20 (7) ◽

pp. 911-927

Author(s):

Lucia Muggia ◽

Yu Quan ◽

Cécile Gueidan ◽

Abdullah M. S. Al-Hatmi ◽

Martin Grube ◽

...

Keyword(s):

Sequence Data ◽

Single Species ◽

Sister Group ◽

Asexual Propagation ◽

Dna Sequence Data ◽

Wide Range ◽

The Family ◽

Rock Inhabiting Fungi ◽

Stable Habitat

AbstractLichen thalli provide a long-lived and stable habitat for colonization by a wide range of microorganisms. Increased interest in these lichen-associated microbial communities has revealed an impressive diversity of fungi, including several novel lineages which still await formal taxonomic recognition. Among these, members of the Eurotiomycetes and Dothideomycetes usually occur asymptomatically in the lichen thalli, even if they share ancestry with fungi that may be parasitic on their host. Mycelia of the isolates are characterized by melanized cell walls and the fungi display exclusively asexual propagation. Their taxonomic placement requires, therefore, the use of DNA sequence data. Here, we consider recently published sequence data from lichen-associated fungi and characterize and formally describe two new, individually monophyletic lineages at family, genus, and species levels. The Pleostigmataceae fam. nov. and Melanina gen. nov. both comprise rock-inhabiting fungi that associate with epilithic, crust-forming lichens in subalpine habitats. The phylogenetic placement and the monophyly of Pleostigmataceae lack statistical support, but the family was resolved as sister to the order Verrucariales. This family comprises the species Pleostigma alpinum sp. nov., P. frigidum sp. nov., P. jungermannicola, and P. lichenophilum sp. nov. The placement of the genus Melanina is supported as a lineage within the Chaetothyriales. To date, this genus comprises the single species M. gunde-cimermaniae sp. nov. and forms a sister group to a large lineage including Herpotrichiellaceae, Chaetothyriaceae, Cyphellophoraceae, and Trichomeriaceae. The new phylogenetic analysis of the subclass Chaetothyiomycetidae provides new insight into genus and family level delimitation and classification of this ecologically diverse group of fungi.

Download Full-text

Nanopore sequencing technology and tools for genome assembly: computational analysis of the current state, bottlenecks and future directions

Briefings in Bioinformatics ◽

10.1093/bib/bby017 ◽

2018 ◽

Vol 20 (4) ◽

pp. 1542-1559 ◽

Cited By ~ 44

Author(s):

Damla Senol Cali ◽

Jeremie S Kim ◽

Saugata Ghose ◽

Can Alkan ◽

Onur Mutlu

Keyword(s):

Sequence Analysis ◽

Genome Assembly ◽

Sequence Data ◽

Error Rates ◽

Nanopore Sequencing ◽

Memory Usage ◽

Sequencing Technology ◽

Assembly Pipeline ◽

And Performance ◽

Polishing Tool

Abstract Nanopore sequencing technology has the potential to render other sequencing technologies obsolete with its ability to generate long reads and provide portability. However, high error rates of the technology pose a challenge while generating accurate genome assemblies. The tools used for nanopore sequence analysis are of critical importance, as they should overcome the high error rates of the technology. Our goal in this work is to comprehensively analyze current publicly available tools for nanopore sequence analysis to understand their advantages, disadvantages and performance bottlenecks. It is important to understand where the current tools do not perform well to develop better tools. To this end, we (1) analyze the multiple steps and the associated tools in the genome assembly pipeline using nanopore sequence data, and (2) provide guidelines for determining the appropriate tools for each step. Based on our analyses, we make four key observations: (1) the choice of the tool for basecalling plays a critical role in overcoming the high error rates of nanopore sequencing technology. (2) Read-to-read overlap finding tools, GraphMap and Minimap, perform similarly in terms of accuracy. However, Minimap has a lower memory usage, and it is faster than GraphMap. (3) There is a trade-off between accuracy and performance when deciding on the appropriate tool for the assembly step. The fast but less accurate assembler Miniasm can be used for quick initial assembly, and further polishing can be applied on top of it to increase the accuracy, which leads to faster overall assembly. (4) The state-of-the-art polishing tool, Racon, generates high-quality consensus sequences while providing a significant speedup over another polishing tool, Nanopolish. We analyze various combinations of different tools and expose the trade-offs between accuracy, performance, memory usage and scalability. We conclude that our observations can guide researchers and practitioners in making conscious and effective choices for each step of the genome assembly pipeline using nanopore sequence data. Also, with the help of bottlenecks we have found, developers can improve the current tools or build new ones that are both accurate and fast, to overcome the high error rates of the nanopore sequencing technology.

Download Full-text

Congenital adrenal hyperplasia: molecular mechanisms resulting in 21-hydroxylase deficiency

Acta Endocrinologica ◽

10.1530/acta.0.112s315 ◽

1986 ◽

Vol 113 (4_Suppl) ◽

pp. S315-S320 ◽

Cited By ~ 7

Author(s):

Patricia A. Donohoue ◽

Cornelis Van Dop ◽

Nicholas Jospe ◽

Claude J. Migeon

Keyword(s):

Congenital Adrenal Hyperplasia ◽

Molecular Mechanisms ◽

Sequence Data ◽

Phenotypic Expression ◽

Restriction Pattern ◽

Class Iii ◽

Adrenal Hyperplasia ◽

Hydroxylase Deficiency ◽

Wide Range ◽

21 Hydroxylase Deficiency

Abstract 21-Hydroxylase deficiency resulting in congenital adrenal hyperplasia (CAH) is a HLA-linked autosomal recessive disorder that has a wide range of phenotypic expression. Two homologous 21-hydroxylase genes (21-OHA and 21-OHB) occur within the Class III region of the major histocompatibility complex, but only one (21-OHB) appears to function in adrenal steroidogenesis. Our restriction maps, and initial sequence data from White et al. (Pediatr Res 20:274A (1986)) for the two human 21-OH genes reveal a high degree of homology between these genes and a reading frame shift mutation in the 21-OHA gene respectively. Among fourteen control subjects, the intragenic restriction patterns of the 21-OHA and 21-OHB genes are invariant. The few restriction fragment length polymorphisms (RFLPs) found in some controls result from polymorphic restriction sites outside the 21-OH genes. In patients with CAH, several different mechanisms for mutation of the 21-OHB gene have been described: 1) deletion of the unique sequences of the 21-OHB gene, 2) conversion of the unique sequences of the 21-OHB gene to those of 21-OHA, and 3) mutations of 21-OHB which do not result in a detectable alteration of restriction pattern (e.g., point mutations). Duplication of the 21-OHA gene has been found in some patients with attenuated CAH; however, the significance of this finding remains unclear.

Download Full-text

A50 Whole-genome sequencing of African swine fever isolates from Sardinia

Virus Evolution ◽

10.1093/ve/vez002.049 ◽

2019 ◽

Vol 5 (Supplement_1) ◽

Author(s):

C Torresi ◽

F Granberg ◽

L Bertolotti ◽

A Oggiano ◽

B Colitti ◽

...

Keyword(s):

Amino Acid ◽

Sequence Data ◽

Temporal Distribution ◽

African Swine Fever ◽

Amino Acid Identity ◽

Genotype I ◽

Acid Identity ◽

Wide Range ◽

Intergenic Sequences ◽

Genes Encoding

Abstract In order to assess the molecular epidemiology of African swine fever (ASF) in Sardinia, we analyzed a wide range of isolates from wild and domestic pigs over a 31-year period (1978–2009) by genotyping sequence data from the genes encoding the p54 and the p72 proteins and the CVR. On this basis, the analysis of the B602L gene revealed a minor difference, placing the Sardinian isolates into two clusters according to their temporal distribution. As an extension of this study, in order to achieve a higher level of discrimination, three further variable genome regions, namely p30, CD2v, and I73R/I329L, of a large number of isolates collected from outbreaks in the years 2002–14 have been investigated. Sequence analysis of the CD2v region revealed a temporal subdivision of the viruses into two subgroups. These data, together with those from the B602L gene analysis, demonstrated that the viruses circulating in Sardinia belong to p72/genotype I, but since 1990 have undergone minor genetic variations in respect to its ancestor, thus making it impossible to trace isolates, enabling a more accurate assessment of the origin of outbreaks, and extending knowledge of virus evolution. To solve this problem, we have sequenced and annotated the complete genome of nine ASF isolates collected in Sardinia between 1978 and 2012. This was achieved using sequence data determined by next-generation sequencing. The results showed a very high identity with range of nucleotide similarity among isolates of 99.5 per cent to 99.9 per cent. The ASF virus (ASFV) genomes were composed of terminal inverted repeats and conserved and non-conserved ORFs. Among the conserved ORFs, B385R, H339R, and O61R-p12 showed 100 per cent amino acid identity. The same was true for the hypervariable ORFs, with regard to X69R, DP96R, DP60R, EP153R, B407L, I10L, and L60L genes. The EP402R and B602L genes showed, as expected, an amino acid identity range of 98.5 per cent to 100 per cent and 91 per cent to 100 per cent, respectively. In addition, all of the isolates displayed variable intergenic sequences. As a whole, the results from our studies confirmed a remarkable genetic stability of the ASFV/p72 genotype I viruses circulating in Sardinia.

Download Full-text

The fractured landscape of RNA-seq alignment: The default in our STARs

10.1101/220681 ◽

2017 ◽

Cited By ~ 1

Author(s):

Sara Ballouz ◽

Alexander Dobin ◽

Thomas Gingeras ◽

Jesse Gillis

Keyword(s):

Expression Profile ◽

Rna Seq ◽

Model Data ◽

Mhc Genes ◽

Wide Range ◽

Biological Discovery ◽

Biological Performance ◽

Expression Quantification

ABSTRACTMany tools are available for RNA-seq alignment and expression quantification, with comparative value being hard to establish. Benchmarking assessments often highlight methods’ good performance, but are focused on either model data or fail to explain variation in performance. This leaves us to ask, what is the most meaningful way to assess different alignment choices? And importantly, where is there room for progress? In this work, we explore the answers to these two questions by performing an exhaustive assessment of the STAR aligner. We assess STAR’s performance across a range of alignment parameters using common metrics, and then on biologically focused tasks. We find technical metrics such as fraction mapping or expression profile correlation to be uninformative, capturing properties unlikely to have any role in biological discovery. Surprisingly, we find that changes in alignment parameters within a wide range have little impact on both technical and biological performance. Yet, when performance finally does break, it happens in difficult regions, such as X-Y paralogs and MHC genes. We believe improved reporting by developers will help establish where results are likely to be robust or fragile, providing a better baseline to establish where methodological progress can still occur.

Download Full-text

SeRenDIP-CE: Sequence-based Interface Prediction for Conformational Epitopes

10.1101/2020.11.19.390500 ◽

2020 ◽

Author(s):

Qingzhen Hou ◽

Bas Stringer ◽

Katharina Waury ◽

Henriette Capel ◽

Reza Haydarlou ◽

...

Keyword(s):

Rna Binding ◽

Sequence Data ◽

Conformational Epitope ◽

Test Set ◽

Epitope Region ◽

Protein Protein Interaction ◽

Wide Range ◽

Single Antigen ◽

Wet Lab ◽

Antigen Structure

AbstractMotivationAntibodies play an important role in clinical research and biotechnology, with their specificity determined by the interaction with the antigen’s epitope region, as a special type of protein-protein interaction (PPI) interface. The ubiquitous availability of sequence data, allows us to predicting epitopes from sequence in order to focus time-consuming wet-lab experiments onto the most promising epitope regions. Here, we extend our previously developed sequence-based predictors for homodimer and heterodimer PPI interfaces to predict epitope residues that have the potential to bind an antibody.ResultsWe collected and curated a high quality epitope dataset from the SAbDaB database. Our generic PPI heterodimer predictor obtained an AUC-ROC of 0.666 when evaluated on the epitope test set. We then trained a random forest model specifically on the epitope dataset, reaching AUC 0.694. Further training on the combined heterodimer and epitope datasets, improves our final predictor to AUC 0.703 on the epitope test set. This is better than the best state-of-the-art sequence-based epitope predictor BepiPred-2.0. On one solved antibody-antigen structure of the COVID19 virus spike RNA binding domain, our predictor reaches AUC 0.778. We added the SeRenDIP-CE Conformational Epitope predictors to our webserver, which is simple to use and only requires a single antigen sequence as input, which will help make the method immediately applicable in a wide range of biomedical and biomolecular research.AvailabilityWebserver, source code and datasets are available at www.ibi.vu.nl/programs/serendipwww/[email protected]

Download Full-text

Rapid antibiotic-resistance predictions from genome sequence data for Staphylococcus aureus and Mycobacterium tuberculosis

Nature Communications ◽

10.1038/ncomms10063 ◽

2015 ◽

Vol 6 (1) ◽

Cited By ~ 281

Author(s):

Phelim Bradley ◽

N. Claire Gordon ◽

Timothy M. Walker ◽

Laura Dunn ◽

Simon Heys ◽

...

Keyword(s):

Staphylococcus Aureus ◽

Mycobacterium Tuberculosis ◽

Sequence Data ◽

Error Rates ◽

Graph Representation ◽

Clinical Samples ◽

Resistant Bacteria ◽

Independent Validation ◽

Validation Set ◽

Sensitivity Specificity

Abstract The rise of antibiotic-resistant bacteria has led to an urgent need for rapid detection of drug resistance in clinical samples, and improvements in global surveillance. Here we show how de Bruijn graph representation of bacterial diversity can be used to identify species and resistance profiles of clinical isolates. We implement this method for Staphylococcus aureus and Mycobacterium tuberculosis in a software package (‘Mykrobe predictor’) that takes raw sequence data as input, and generates a clinician-friendly report within 3 minutes on a laptop. For S. aureus, the error rates of our method are comparable to gold-standard phenotypic methods, with sensitivity/specificity of 99.1%/99.6% across 12 antibiotics (using an independent validation set, n=470). For M. tuberculosis, our method predicts resistance with sensitivity/specificity of 82.6%/98.5% (independent validation set, n=1,609); sensitivity is lower here, probably because of limited understanding of the underlying genetic mechanisms. We give evidence that minor alleles improve detection of extremely drug-resistant strains, and demonstrate feasibility of the use of emerging single-molecule nanopore sequencing techniques for these purposes.

Download Full-text

Microbial Dextran-Hydrolyzing Enzymes: Fundamentals and Applications

Microbiology and Molecular Biology Reviews ◽

10.1128/mmbr.69.2.306-325.2005 ◽

2005 ◽

Vol 69 (2) ◽

pp. 306-325 ◽

Cited By ~ 128

Author(s):

Elvira Khalikova ◽

Petri Susi ◽

Timo Korpela

Keyword(s):

Sequence Data ◽

Three Dimensional ◽

Side Reaction ◽

Sequence Information ◽

New Classification ◽

Wide Range ◽

Hydrolyzing Enzymes ◽

Complex Polymer ◽

Sequence Similarities ◽

Primary Sequence Data

SUMMARY Dextran is a chemically and physically complex polymer, breakdown of which is carried out by a variety of endo- and exodextranases. Enzymes in many groups can be classified as dextranases according to function: such enzymes include dextranhydrolases, glucodextranases, exoisomaltohydrolases, exoisomaltotriohydrases, and branched-dextran exo-1,2-α-glucosidases. Cycloisomalto-oligosaccharide glucanotransferase does not formally belong to the dextranases even though its side reaction produces hydrolyzed dextrans. A new classification system for glycosylhydrolases and glycosyltransferases, which is based on amino acid sequence similarities, divides the dextranases into five families. However, this classification is still incomplete since sequence information is missing for many of the enzymes that have been biochemically characterized as dextranases. Dextran-degrading enzymes have been isolated from a wide range of microorganisms. The major characteristics of these enzymes, the methods for analyzing their activities and biological roles, analysis of primary sequence data, and three-dimensional structures of dextranases have been dealt with in this review. Dextranases are promising for future use in various scientific and biotechnological applications.

Download Full-text