What is an archaeon and are the Archaea really unique?

PeerJ ◽

10.7717/peerj.5770 ◽

2018 ◽

Vol 6 ◽

pp. e5770 ◽

Cited By ~ 6

Author(s):

Ajith Harish

Keyword(s):

Sequence Data ◽

Deep Structure ◽

Distribution Patterns ◽

Primary Sequence ◽

Sequence Alignments ◽

Evolutionary Transitions ◽

Genomic Signatures ◽

Substitution Mutations ◽

Major Branch ◽

Core Genes

The recognition of the group Archaea as a major branch of the tree of life (ToL) prompted a new view of the evolution of biodiversity. The genomic representation of archaeal biodiversity has since significantly increased. In addition, advances in phylogenetic modeling of multi-locus datasets have resolved many recalcitrant branches of the ToL. Despite the technical advances and an expanded taxonomic representation, two important aspects of the origins and evolution of the Archaea remain controversial, even as we celebrate the 40th anniversary of the monumental discovery. These issues concern (i) the uniqueness (monophyly) of the Archaea, and (ii) the evolutionary relationships of the Archaea to the Bacteria and the Eukarya; both of these are relevant to the deep structure of the ToL. To explore the causes for this persistent ambiguity, I examine multiple datasets and different phylogenetic approaches that support contradicting conclusions. I find that the uncertainty is primarily due to a scarcity of information in standard datasets—universal core-genes datasets—to reliably resolve the conflicts. These conflicts can be resolved efficiently by comparing patterns of variation in the distribution of functional genomic signatures, which are less diffused unlike patterns of primary sequence variation. Relatively lower heterogeneity in distribution patterns minimizes uncertainties and supports statistically robust phylogenetic inferences, especially of the earliest divergences of life. This case study further highlights the limitations of primary sequence data in resolving difficult phylogenetic problems, and raises questions about evolutionary inferences drawn from the analyses of sequence alignments of a small set of core genes. In particular, the findings of this study corroborate the growing consensus that reversible substitution mutations may not be optimal phylogenetic markers for resolving early divergences in the ToL, nor for determining the polarity of evolutionary transitions across the ToL.

Download Full-text

What is an archaeon and are the Archaea really unique?

10.1101/256263 ◽

2018 ◽

Author(s):

Ajith Harish

Keyword(s):

Sequence Data ◽

Deep Structure ◽

Distribution Patterns ◽

Primary Sequence ◽

Genomic Signatures ◽

Multiple Datasets ◽

Technical Advances ◽

Small Set ◽

Major Branch ◽

Core Genes

AbstractThe recognition of the group Archaea as a major branch of the Tree of Life (ToL) prompted a new view of the evolution of biodiversity. The genomic representation of archaeal biodiversity has since significantly increased. In addition, advances in phylogenetic modeling of multi-locus datasets have resolved many recalcitrant branches of the ToL. Despite the technical advances and an expanded taxonomic representation, two important aspects of the origins and evolution of the Archaea remain controversial, even as we celebrate the 40th anniversary of the monumental discovery. These issues concern (i) the uniqueness (monophyly) of the Archaea, and (ii) the evolutionary relationships of the Archaea to the Bacteria and the Eukarya; both of these are relevant to the deep structure of the ToL. Here, to explore the causes for this persistent ambiguity, I examine multiple datasets that support contradicting conclusions. Results indicate that the uncertainty is primarily due to a scarcity of information in standard datasets — the core genes datasets — to reliably resolve the conflicts. These conflicts can be resolved efficiently by comparing patterns of variation in the distribution of functional genomic signatures, which are less diffused unlike patterns of primary sequence variation. Relatively lower heterogeneity in distribution patterns minimizes uncertainties, which supports statistically robust phylogenetic inferences, especially of the earliest divergences of life. This case study further highlights the limits of primary sequence data in resolving difficult phylogenetic problems and casts doubt on evolutionary inferences drawn solely from the analyses of a small set of core genes.

Download Full-text

Neisseria meningitidis has acquired sequences within the capsule locus by horizontal genetic transfer

Wellcome Open Research ◽

10.12688/wellcomeopenres.15333.1 ◽

2019 ◽

Vol 4 ◽

pp. 99

Author(s):

Marianne E. A. Clemence ◽

Odile B. Harrison ◽

Martin C. J. Maiden

Keyword(s):

Neisseria Meningitidis ◽

Sequence Data ◽

Whole Genome Sequence ◽

Accession Number ◽

Sequence Alignments ◽

En Bloc ◽

Genetic Transfer ◽

Diverse Range ◽

Homologous Sequences ◽

Capsule Locus

Background:Expression of a capsule from one of serogroups A, B, C, W, X or Y is usually required forNeisseria meningitidis(Nme) to cause invasive meningococcal disease. The capsule is encoded by the capsule locus,cps, which is proposed to have been acquired by a formerly capsule null organism by horizontal genetic transfer (HGT) from another species. Following identification of putative capsule genes in non-pathogenicNeisseriaspecies, this hypothesis is re-examined.Methods:Whole genome sequence data fromNeisseriaspecies, includingNmegenomes from a diverse range of clonal complexes and capsule genogroups, and non-Neisseriaspecies, were obtained from PubMLST and GenBank. Sequence alignments of genes from the meningococcalcps, and predicted orthologues in other species, were analysed using Neighbor-nets, BOOTSCANing and maximum likelihood phylogenies.Results:The meningococcalcpswas highly mosaic within regions B, C and D. A subset of sequences within regions B and C were phylogenetically nested within homologous sequences belonging toN. subflava, consistent with HGT event in whichN. subflavawas the donor. In thecpsof 23/39 isolates, the two copies of region D were highly divergent, withrfbABC’sequences being more closely related to predicted orthologues in the proposed speciesN. weixii (GenBank accession numberCP023429.1) than the same genes inNmeisolates lacking a capsule. There was also evidence of mosaicism in therfbABC’sequences of the remaining 16 isolates, as well asrfbABCfrom many isolates.Conclusions:Data are consistent with theen blocacquisition ofcpsin meningococci fromN. subflava, followed by further recombination events with otherNeisseriaspecies. Nevertheless, the data cannot refute an alternative model, in which native meningococcal capsule existed prior to undergoing HGT withN. subflavaand other species. Within-genus recombination events may have given rise to the diversity of meningococcal capsule serogroups.

Download Full-text

VisFeature: a stand-alone program for visualizing and analyzing statistical features of biological sequences

Bioinformatics ◽

10.1093/bioinformatics/btz689 ◽

2019 ◽

Cited By ~ 3

Author(s):

Jun Wang ◽

Pu-Feng Du ◽

Xin-Yu Xue ◽

Guang-Ping Li ◽

Yuan-Ke Zhou ◽

...

Keyword(s):

Sequence Data ◽

Software Tool ◽

Data Retrieval ◽

Supplementary Information ◽

Statistical Features ◽

Biological Sequence ◽

Sequence Alignments ◽

Multiple Sequence ◽

Source Codes ◽

Multiple Sequence Alignments

Abstract Summary Many efforts have been made in developing bioinformatics algorithms to predict functional attributes of genes and proteins from their primary sequences. One challenge in this process is to intuitively analyze and to understand the statistical features that have been selected by heuristic or iterative methods. In this paper, we developed VisFeature, which aims to be a helpful software tool that allows the users to intuitively visualize and analyze statistical features of all types of biological sequence, including DNA, RNA and proteins. VisFeature also integrates sequence data retrieval, multiple sequence alignments and statistical feature generation functions. Availability and implementation VisFeature is a desktop application that is implemented using JavaScript/Electron and R. The source codes of VisFeature are freely accessible from the GitHub repository (https://github.com/wangjun1996/VisFeature). The binary release, which includes an example dataset, can be freely downloaded from the same GitHub repository (https://github.com/wangjun1996/VisFeature/releases). Contact [email protected] or [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Prediction of mutation effects using a deep temporal convolutional network

Bioinformatics ◽

10.1093/bioinformatics/btz873 ◽

2019 ◽

Vol 36 (7) ◽

pp. 2047-2052 ◽

Cited By ~ 1

Author(s):

Ha Young Kim ◽

Dongsup Kim

Keyword(s):

Latent Variable ◽

Sequence Data ◽

Generative Model ◽

Supplementary Information ◽

Biological Research ◽

Sequence Alignments ◽

Variable Model ◽

Convolutional Network ◽

Direct Optimization ◽

Multiple Sequence

Abstract Motivation Accurate prediction of the effects of genetic variation is a major goal in biological research. Towards this goal, numerous machine learning models have been developed to learn information from evolutionary sequence data. The most effective method so far is a deep generative model based on the variational autoencoder (VAE) that models the distributions using a latent variable. In this study, we propose a deep autoregressive generative model named mutationTCN, which employs dilated causal convolutions and attention mechanism for the modeling of inter-residue correlations in a biological sequence. Results We show that this model is competitive with the VAE model when tested against a set of 42 high-throughput mutation scan experiments, with the mean improvement in Spearman rank correlation ∼0.023. In particular, our model can more efficiently capture information from multiple sequence alignments with lower effective number of sequences, such as in viral sequence families, compared with the latent variable model. Also, we extend this architecture to a semi-supervised learning framework, which shows high prediction accuracy. We show that our model enables a direct optimization of the data likelihood and allows for a simple and stable training process. Availability and implementation Source code is available at https://github.com/ha01994/mutationTCN. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Genomic signatures of evolutionary transitions from solitary to group living

Science ◽

10.1126/science.aaa4788 ◽

2015 ◽

Vol 348 (6239) ◽

pp. 1139-1143 ◽

Cited By ~ 214

Author(s):

K. M. Kapheim ◽

H. Pan ◽

C. Li ◽

S. L. Salzberg ◽

D. Puiu ◽

...

Keyword(s):

Group Living ◽

Evolutionary Transitions ◽

Genomic Signatures

Download Full-text

Validation and development of COI metabarcoding primers for freshwater macroinvertebrate bioassessment

10.7287/peerj.preprints.2044v5 ◽

2017 ◽

Author(s):

Vasco Elbrecht ◽

Florian Leese

Keyword(s):

In Silico ◽

Sequence Data ◽

Human Impacts ◽

Biodiversity Loss ◽

Freshwater Ecosystems ◽

Amplification Efficiency ◽

Mock Community ◽

Sequence Alignments ◽

Freshwater Invertebrate ◽

Dna Metabarcoding

A central challenge in the present era of biodiversity loss is to assess and manage human impacts on freshwater ecosystems. Macroinvertebrates are an important group for bioassessment as many taxa show specific responses to environmental conditions. However, generating accurate macroinvertebrate inventories based on larval morphology is difficult and error-prone. Here, DNA metabarcoding provides new opportunities. Its potential to accurately identify invertebrates in bulk samples to the species level, has been demonstrated in several case studies. However, DNA based identification is often limited by primer bias, potentially leading to taxa in the sample remaining undetected. Thus, the success of DNA metabarcoding as an emerging technique for bioassessment critically relies on carefully evaluating primers. We used the R package PrimerMiner to obtain and process cytochrome c oxidase I (COI) sequence data for the 15 most globally relevant freshwater invertebrate groups for stream assessment. Using these sequence alignments, we developed four primer combinations optimized for freshwater macrozoobenthos. All primers were evaluated by sequencing ten mock community samples, each consisting of 52 freshwater invertebrate taxa. Additionally, popular metabarcoding primers from the literature and the developed primers were tested in silico against the 15 relevant invertebrate groups. The developed primers varied in amplification efficiency and the number of detected taxa, yet all detected more taxa than standard ‘Folmer’ barcoding primers. Two new primer combinations showed more consistent amplification than a previously tested ribosomal marker (16S) and detected all 42 insect taxa present in the mock community samples. In silico evaluation revealed critical design flaws in some commonly used primers from the literature. We demonstrate a reliable strategy to develop optimized primers using the tool PrimerMiner. The developed primers detected almost all taxa present in the mock samples, and we argue that high base degeneracy is necessary to decrease primer bias as confirmed by experimental results and in silico primer evaluation. We further demonstrate that some primers currently used in metabarcoding studies may not be suitable for amplification of freshwater macroinvertebrates. Therefore, careful primer evaluation and more region / ecosystem specific primers are needed before DNA metabarcoding can be used for routine bioassessment of freshwater ecosystems.

Download Full-text

Revealing evolutionary constraints on proteins through sequence analysis

10.1101/397521 ◽

2018 ◽

Author(s):

Shou-Wen Wang ◽

Anne-Florence Bitbol ◽

Ned S. Wingreen

Keyword(s):

Amino Acids ◽

Covariance Matrix ◽

Sequence Data ◽

Amino Acid Sequences ◽

Elastic Network Model ◽

Sequence Alignments ◽

Cellular Processes ◽

Large Numbers ◽

Protein Properties ◽

Selected Traits

AbstractStatistical analysis of alignments of large numbers of protein sequences has revealed “sectors” of collectively coevolving amino acids in several protein families. Here, we show that selection acting on any functional property of a protein, represented by an additive trait, can give rise to such a sector. As an illustration of a selected trait, we consider the elastic energy of an important conformational change within an elastic network model, and we show that selection acting on this energy leads to correlations among residues. For this concrete example and more generally, we demonstrate that the main signature of functional sectors lies in the small-eigenvalue modes of the covariance matrix of the selected sequences. However, secondary signatures of these functional sectors also exist in the extensively-studied large-eigenvalue modes. Our simple, general model leads us to propose a principled method to identify functional sectors, along with the magnitudes of mutational effects, from sequence data. We further demonstrate the robustness of these functional sectors to various forms of selection, and the robustness of our approach to the identification of multiple selected traits.Author summaryProteins play crucial parts in all cellular processes, and their functions are encoded in their amino-acid sequences. Recently, statistical analyses of protein sequence alignments have demonstrated the existence of “sectors” of collectively correlated amino acids. What is the origin of these sectors? Here, we propose a simple underlying origin of protein sectors: they can arise from selection acting on any collective protein property. We find that the main signature of these functional sectors lies in the low-eigenvalue modes of the covariance matrix of the selected sequences. A better understanding of protein sectors will make it possible to discern collective protein properties directly from sequences, as well as to design new functional sequences, with far-reaching applications in synthetic biology.

Download Full-text

The Unreasonable Effectiveness of Convolutional Neural Networks in Population Genetic Inference

10.1101/336073 ◽

2018 ◽

Cited By ~ 3

Author(s):

Lex Flagel ◽

Yaniv Brandvain ◽

Daniel R. Schrider

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Population Genetic ◽

Sequence Data ◽

Input Sequence ◽

Evolutionary Model ◽

Sequence Alignments ◽

Likelihood Approach ◽

Population Genetic Inference ◽

Genetic Inference

ABSTRACTPopulation-scale genomic datasets have given researchers incredible amounts of information from which to infer evolutionary histories. Concomitant with this flood of data, theoretical and methodological advances have sought to extract information from genomic sequences to infer demographic events such as population size changes and gene flow among closely related populations/species, construct recombination maps, and uncover loci underlying recent adaptation. To date most methods make use of only one or a few summaries of the input sequences and therefore ignore potentially useful information encoded in the data. The most sophisticated of these approaches involve likelihood calculations, which require theoretical advances for each new problem, and often focus on a single aspect of the data (e.g. only allele frequency information) in the interest of mathematical and computational tractability. Directly interrogating the entirety of the input sequence data in a likelihood-free manner would thus offer a fruitful alternative. Here we accomplish this by representing DNA sequence alignments as images and using a class of deep learning methods called convolutional neural networks (CNNs) to make population genetic inferences from these images. We apply CNNs to a number of evolutionary questions and find that they frequently match or exceed the accuracy of current methods. Importantly, we show that CNNs perform accurate evolutionary model selection and parameter estimation, even on problems that have not received detailed theoretical treatments. Thus, when applied to population genetic alignments, CNN are capable of outperforming expert-derived statistical methods, and offer a new path forward in cases where no likelihood approach exists.

Download Full-text

Prowler: A novel trimming algorithm for Oxford Nanopore sequence data

10.1101/2021.05.09.443332 ◽

2021 ◽

Author(s):

Simon Lee ◽

Loan T. Nguyen ◽

Ben J. Hayes ◽

Elizabeth M Ross

Keyword(s):

Sequence Data ◽

Error Rates ◽

Read Length ◽

Sequencing Analysis ◽

Sequence Alignments ◽

Lower Error ◽

Oxford Nanopore ◽

High Quality Sequence ◽

Dna Sequencing Analysis ◽

Window Approach

Motivation: Quality control (QC) tools are critical in DNA sequencing analysis because they increase the accuracy of sequence alignments and thus the reliability of results. Oxford Nanopore Technologies (ONT) QC is currently rudimentary, generally based on whole read average quality. This results in discarding reads that contain regions of high quality sequence. Here we propose Prowler, a multi-window approach inspired by algorithms used to QC short read data. Importantly, we retain the phase and read length information by optionally replacing trimmed sections with Ns. Results: Prowler was applied to mammalian and bacterial datasets, to assess effects on alignment and assembly respectively. Compared to Nanofilt, alignments of data QCed with Prowler had lower error rates and more mapped reads. Assemblies of Prowler QCed data had a lower error rate than Nanofilt QCed data however this came at some cost to assembly contiguity. Availability and implementation: Prowler is implemented in Python and is available at: https://github.com/ProwlerForNanopore/ProwlerTrimmer Contact: [email protected]

Download Full-text

A Preliminary Investigation of Marine Yeast Biodiversity in New Zealand Waters

10.26686/wgtn.17004976 ◽

2021 ◽

Author(s):

◽

Melissa Francis

Keyword(s):

Sequence Data ◽

Sequence Similarity ◽

Preliminary Investigation ◽

Distribution Patterns ◽

Marine Yeasts ◽

Marine Yeast ◽

Yeast Biomass ◽

Ecological Implications ◽

Terrestrial Environments ◽

Biocontrol Potential

<p>This is the first known investigation of marine yeast biodiversity from waters surrounding New Zealand’s main Islands. Marine yeasts were cultured onto agar plates from algae sampled at three locations in the Wellington Region. DNA extractions and PCR amplifications of the internal transcribed spacer (ITS) regions were conducted, and resultant sequence data were used for isolate identification and phylogenetic analysis. Yeasts isolated during this investigation were not unique; seventy-four isolates were identified from a range of genera that are frequently detected in marine and terrestrial environments worldwide. Furthermore, high ITS sequence similarity was observed between yeasts isolated during this investigation and those from geographically distant locations. These findings may indicate that marine yeasts are ubiquitous at a global level, although evidence is insufficient as to whether yeasts also demonstrate biogeographic distribution patterns. Yeasts isolated during this investigation may have ecological implications in New Zealand’s marine environment; marine yeasts are likely to play a general saprophytic role and certain genera are pathogenic. Isolates were also identified from genera that have previously demonstrated beneficial properties and applications, including the production of useful compounds and highly nutritious yeast biomass, biocontrol potential against the postharvest decay of produce, and degradation abilities that may enable bioremediation of polluted marine environments.</p>

Download Full-text