dRep: A tool for fast and accurate genome de-replication that enables tracking of microbial genotypes and improved genome recovery from metagenomes

Mapping Intimacies ◽

10.1101/108142 ◽

2017 ◽

Cited By ~ 3

Author(s):

Matthew R. Olm ◽

Christopher T. Brown ◽

Brandon Brooks ◽

Jillian F. Banfield

Keyword(s):

Time Series ◽

Source Code ◽

Computational Time ◽

Average Nucleotide Identity ◽

Microbial Genomes ◽

Large Genome ◽

Link Type ◽

Assembly Method ◽

Inaccurate Estimation ◽

Genome Distance

The number of microbial genomes sequenced each year is expanding rapidly, in part due to genome-resolved metagenomic studies that routinely recover hundreds of draft-quality genomes. Rapid algorithms have been developed to comprehensively compare large genome sets, but they are not accurate with draft-quality genomes. Here we present dRep, a program that sequentially applies a fast, inaccurate estimation of genome distance and a slow but accurate measure of average nucleotide identity to reduce the computational time for pair-wise genome set comparisons by orders of magnitude. We demonstrate its use in a study where we separately assembled each metagenome from time series datasets. Groups of essentially identical genomes were identified with dRep, and the best genome from each set was selected. This resulted in recovery of significantly more and higher-quality genomes compared to the set recovered using the typical co-assembly method. Documentation is available at http://drep.readthedocs.io/en/master/ and source code is available at https://github.com/MrOlm/drep.

Download Full-text

Struo: a pipeline for building custom databases for common metagenome profilers

10.1101/774372 ◽

2019 ◽

Author(s):

Jacobo de la Cuesta-Zuluaga ◽

Ruth E. Ley ◽

Nicholas D. Youngblut

Keyword(s):

Microbial Diversity ◽

Microbial Communities ◽

Source Code ◽

Functional Information ◽

Microbial Genomes ◽

Link Type ◽

Public Repositories

AbstractSummaryTaxonomic and functional information from microbial communities can be efficiently obtained by metagenome profiling, which requires databases of genes and genomes to which sequence reads are mapped. However, the databases that accompany metagenome profilers are not updated at a pace that matches the increase in available microbial genomes. To address this, we developed Struo, a modular pipeline that automatizes the acquisition of genomes from public repositories and the construction of custom databases for multiple metagenome profilers. The use of custom databases that broadly represent the known microbial diversity by incorporating novel genomes results in a substantial increase in mappability of reads in synthetic and real metagenome datasets.Availability and implementationSource code available for download at https://github.com/leylabmpi/Struo. Custom GTDB databases available at http://ftp.tue.mpg.de/ebio/projects/struo/[email protected]

Download Full-text

PyIOmica: Longitudinal Omics Analysis and Classification

10.1101/708941 ◽

2019 ◽

Author(s):

Sergii Domanskyi ◽

Carlo Piermarocchi ◽

George I. Mias

Keyword(s):

Time Series ◽

Gene Ontology ◽

Source Code ◽

Temporal Trends ◽

Enrichment Analysis ◽

Data Normalization ◽

Link Type ◽

Visibility Graphs ◽

Microsoft Windows ◽

Python Package

AbstractSummaryPyIOmica is an open-source Python package focusing on integrating longitudinal multiple omics datasets, characterizing, and classifying temporal trends. The package includes multiple bioinformatics tools including data normalization, annotation, classification, visualization, and enrichment analysis for gene ontology terms and pathways. Additionally, the package includes an implementation of visibility graphs to visualize time series as networks.Availability and implementationPyIOmica is implemented as a Python package (pyiomica), available for download and installation through the Python Package Index (PyPI) (https://pypi.python.org/pypi/pyiomica), and can be deployed using the Python import function following installation. PyIOmica has been tested on Mac OS X, Unix/Linux and Microsoft Windows. The application is distributed under an MIT license. Source code for each release is also available for download on Zenodo (https://doi.org/10.5281/zenodo.3342612)[email protected]

Download Full-text

Genome analysis reveals that the correct name of type strain Adlercreutzia caecicola DSM 22242T is Parvibacter caecicola Clavel et al. 2013

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijsem.0.004814 ◽

2021 ◽

Vol 71 (5) ◽

Author(s):

Dominic A. Stoll ◽

Nicolas Danylec ◽

Christina Grimmler ◽

Sabine E. Kulling ◽

Melanie Huch

Keyword(s):

Type Strain ◽

Type Species ◽

Draft Genome ◽

Amino Acid Identity ◽

Average Nucleotide Identity ◽

Content Type ◽

Link Type ◽

Genome Wide ◽

Average Amino Acid Identity ◽

Biochemical Analyses

The strain Adlercreutzia caecicola DSM 22242T (=CCUG 57646T=NR06T) was taxonomically described in 2013 and named as Parvibacter caecicola Clavel et al. 2013. In 2018, the name of the strain DSM 22242T was changed to Adlercreutzia caecicola (Clavel et al. 2013) Nouioui et al. 2018 due to taxonomic investigations of the closely related genera Adlercreutzia, Asaccharobacter and Enterorhabdus within the phylum Actinobacteria . However, the first whole draft genome of strain DSM 22242T was published by our group in 2019. Therefore, the genome was not available within the study of Nouioui et al. (2018). The results of the polyphasic approach within this study, including phenotypic and biochemical analyses and genome-based taxonomic investigations [genome-wide average nucleotide identity (gANI), alignment fraction (AF), average amino acid identity (AAI), percentage of orthologous conserved proteins (POCP) and genome blast distance phylogeny (GBDP) tree], indicated that the proposed change of the name Parvibacter caecicola to Adlercreutzia caecicola was not correct. Therefore, it is proposed that the correct name of Adlercreutzia caecicola (Clavel et al. 2013) Nouioui et al. 2018 strain DSM 22242T is Parvibacter caecicola Clavel et al. 2013.

Download Full-text

Novel Application to Recognize a Breakdown Pressure Event on Time Series Frac Data Vs. an Artificial Intelligence Approach

10.2118/200846-ms ◽

2021 ◽

Author(s):

Alberto Jose Ramirez ◽

Jessica Graciela Iriarte

Keyword(s):

Neural Network ◽

Time Series ◽

Service Providers ◽

Complex Model ◽

Rate Of Change ◽

Computational Time ◽

Breakdown Pressure ◽

Constant Multiple ◽

Artificial Neural ◽

Rule Based Approach

Abstract Breakdown pressure is the peak pressure attained when fluid is injected into a borehole until fracturing occurs. Hydraulic fracturing operations are conducted above the breakdown pressure, at which the rock formation fractures and allows fluids to flow inside. This value is essential to obtain formation stress measurements. The objective of this study is to automate the selection of breakdown pressure flags on time series fracture data using a novel algorithm in lieu of an artificial neural network. This study is based on high-frequency treatment data collected from a cloud-based software. The comma separated (.csv) files include treating pressure (TP), slurry rate (SR), and bottomhole proppant concentration (BHPC) with defined start and end time flags. Using feature engineering, the model calculates the rate of change of treating pressure (dtp_1st) slurry rate (dsr_1st), and bottomhole proppant concentration (dbhpc_1st). An algorithm isolates the initial area of the treatment plot before proppant reaches the perforations, the slurry rate is constant, and the pressure increases. The first approach uses a neural network trained with 872 stages to isolate the breakdown pressure area. The expert rule-based approach finds the highest pressure spikes where SR is constant. Then, a refining function finds the maximum treating pressure value and returns its job time as the predicted breakdown pressure flag. Due to the complexity of unconventional reservoirs, the treatment plots may show pressure changes while the slurry rate is constant multiple times during the same stage. The diverse behavior of the breakdown pressure inhibits an artificial neural network's ability to find one "consistent pattern" across the stage. The multiple patterns found through the stage makes it difficult to select an area to find the breakdown pressure value. Testing this complex model worked moderately well, but it made the computational time too high for deployment. On the other hand, the automation algorithm uses rules to find the breakdown pressure value with its location within the stage. The breakdown flag model was validated with 102 stages and tested with 775 stages, returning the location and values corresponding to the highest pressure point. Results show that 86% of the predicted breakdown pressures are within 65 psi of manually picked values. Breakdown pressure recognition automation is important because it saves time and allows engineers to focus on analytical tasks instead of repetitive data-structuring tasks. Automating this process brings consistency to the data across service providers and basins. In some cases, due to its ability to zoom-in, the algorithm recognized breakdown pressures with higher accuracy than subject matter experts. Comparing the results from two different approaches allowed us to conclude that similar or better results with lower running times can be achieved without using complex algorithms.

Download Full-text

Limosilactobacillus urinaemulieris sp. nov. and Limosilactobacillus portuensis sp. nov. isolated from urine of healthy women

INTERNATIONAL JOURNAL OF SYSTEMATIC AND EVOLUTIONARY MICROBIOLOGY ◽

10.1099/ijsem.0.004726 ◽

2019 ◽

Vol 71 (3) ◽

Cited By ~ 7

Author(s):

Magdalena Ksiezarek ◽

Teresa Gonçalves Ribeiro ◽

Joana Rocha ◽

Filipa Grosso ◽

Svetlana Ugarcina Perovic ◽

...

Keyword(s):

Type Species ◽

Novel Species ◽

Rrna Gene ◽

Gram Stain ◽

Gene Sequences ◽

16S Rrna Gene Sequences ◽

Content Type ◽

Link Type ◽

Healthy Women ◽

Genome Distance

Two Gram-stain-positive strains, c9Ua_26_MT and c11Ua_112_MT, were isolated from voided urine samples from two healthy women. Comparative 16S rRNA gene sequences demonstrated that these novel strains were members of the genus Limosilactobacillus . Phylogenetic analysis based on pheS gene sequences and core genomes showed that each strain formed a separated branch and are closest to Limosilactobacillus vaginalis DSM 5837T. The average nucleotide identity (ANI) and Genome-to-Genome Distance Calculator (GGDC) values between c9Ua_26_MT and the closest relative DSM 5837T were 90.7 and 42.9 %, respectively. The ANI and GGDC values between c11Ua_112_MT and the closest relative DSM 5837T were 91.2 and 45.0 %, and those among the strains were 92.9% and 51,0 %, respectively. The major fatty acids were C12 : 0 (40.2 %), C16 : 0 (26.7 %) and C18 : 1 ω9c (17.7 %) for strain c9Ua_26_MT, and C18 : 1 ω9c (38.0 %), C16 : 0 (33.3 %) and C12 : 0 (17.6 %) for strain c11Ua_112_MT. The genomic DNA G+C content of strains c9Ua_26_MT and c11Ua_112_MT was 39.9 and 39.7 mol%, respectively. On the basis of the data presented here, strains c9Ua_26_MT and c11Ua_112_MT represent two novel species of the genus Limosilactobacillus , for which the names Limosilactobacillus urinaemulieris sp. nov. (c9Ua_26_MT=CECT 30144T=LMG 31899T) and Limosilactobacillus portuensis sp. nov. (c11Ua_112_MT=CECT 30145T=LMG 31898T) are proposed.

Download Full-text

Multi-time-scale input approaches for hourly-scale rainfall–runoff modeling based on recurrent neural networks

Journal of Hydroinformatics ◽

10.2166/hydro.2021.095 ◽

2021 ◽

Author(s):

Kei Ishida ◽

Masato Kiyama ◽

Ali Ercan ◽

Motoki Amagasaki ◽

Tongbi Tu

Keyword(s):

Time Series ◽

Time Scale ◽

Time Series Data ◽

Series Data ◽

Computational Time ◽

Rainfall Runoff ◽

Runoff Modeling ◽

Input Time ◽

Target Data ◽

Resolution Data

Abstract This study proposes two effective approaches to reduce the required computational time of the training process for time-series modeling through a recurrent neural network (RNN) using multi-time-scale time-series data as input. One approach provides coarse and fine temporal resolutions of the input time-series data to RNN in parallel. The other concatenates the coarse and fine temporal resolutions of the input time-series data over time before considering them as the input to RNN. In both approaches, first, the finer temporal resolution data are utilized to learn the fine temporal scale behavior of the target data. Then, coarser temporal resolution data are expected to capture long-duration dependencies between the input and target variables. The proposed approaches were implemented for hourly rainfall–runoff modeling at a snow-dominated watershed by employing a long short-term memory network, which is a type of RNN. Subsequently, the daily and hourly meteorological data were utilized as the input, and hourly flow discharge was considered as the target data. The results confirm that both of the proposed approaches can reduce the required computational time for the training of RNN significantly. Lastly, one of the proposed approaches improves the estimation accuracy considerably in addition to computational efficiency.

Download Full-text

A Bayesian neural network predicts the dissolution of compact planetary systems

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.2026053118 ◽

2021 ◽

Vol 118 (40) ◽

pp. e2026053118

Author(s):

Miles Cranmer ◽

Daniel Tamayo ◽

Hanno Rein ◽

Peter Battaglia ◽

Samuel Hadden ◽

...

Keyword(s):

Neural Network ◽

Machine Learning ◽

Time Series ◽

Planetary System ◽

Machine Learning Algorithms ◽

Orbital Elements ◽

Bayesian Neural Network ◽

Inference Model ◽

Link Type ◽

Numerical Integrator

We introduce a Bayesian neural network model that can accurately predict not only if, but also when a compact planetary system with three or more planets will go unstable. Our model, trained directly from short N-body time series of raw orbital elements, is more than two orders of magnitude more accurate at predicting instability times than analytical estimators, while also reducing the bias of existing machine learning algorithms by nearly a factor of three. Despite being trained on compact resonant and near-resonant three-planet configurations, the model demonstrates robust generalization to both nonresonant and higher multiplicity configurations, in the latter case outperforming models fit to that specific set of integrations. The model computes instability estimates up to 105 times faster than a numerical integrator, and unlike previous efforts provides confidence intervals on its predictions. Our inference model is publicly available in the SPOCK (https://github.com/dtamayo/spock) package, with training code open sourced (https://github.com/MilesCranmer/bnn_chaos_model).

Download Full-text

idCOV: a pipeline for quick clade identification of SARS-CoV-2 isolates

10.1101/2020.10.08.330456 ◽

2020 ◽

Author(s):

Xun Zhu ◽

Ti-Cheng Chang ◽

Richard Webby ◽

Gang Wu

Keyword(s):

Personal Computer ◽

Source Code ◽

Command Line ◽

Sequencing Data ◽

Link Type ◽

Public Dataset ◽

Virus Isolates

AbstractidCOV is a phylogenetic pipeline for quickly identifying the clades of SARS-CoV-2 virus isolates from raw sequencing data based on a selected clade-defining marker list. Using a public dataset, we show that idCOV can make equivalent calls as annotated by Nextstrain.org on all three common clade systems using user uploaded FastQ files directly. Web and equivalent command-line interfaces are available. It can be deployed on any Linux environment, including personal computer, HPC and the cloud. The source code is available at https://github.com/xz-stjude/idcov. A documentation for installation can be found at https://github.com/xz-stjude/idcov/blob/master/README.md.

Download Full-text

CAMITAX: Taxon labels for microbial genomes

10.1101/532473 ◽

2019 ◽

Author(s):

Andreas Bremges ◽

Adrian Fritz ◽

Alice C. McHardy

Keyword(s):

Single Cells ◽

Ensemble Classification ◽

Rrna Gene ◽

Genome Sequences ◽

Microbial Genomes ◽

Phylogenetic Placement ◽

Gene Homology ◽

Reference Databases ◽

Taxonomic Assignments ◽

Genome Distance

The number of microbial genome sequences is growing exponentially, also thanks to recent advances in recovering complete or near-complete genomes from metagenomes and single cells. Assigning reliable taxon labels to genomes is key and often a prerequisite for downstream analyses. We introduce CAMITAX, a scalable and reproducible workflow for the taxonomic labelling of microbial genomes recovered from isolates, single cells, and metagenomes. CAMI-TAX combines genome distance-, 16S rRNA gene-, and gene homology-based taxonomic assignments with phylogenetic placement. It uses Nextflow to orchestrate reference databases and software containers, and thus combines ease of installation and use with computational re-producibility. We evaluated the method on several hundred metagenome-assembled genomes with high-quality taxonomic annotations from the TARA Oceans project, and show that the ensemble classification method in CAMITAX improved on all individual methods across tested ranks. While we initially developed CAMITAX to aid the Critical Assessment of Metagenome Interpretation (CAMI) initiative, it evolved into a comprehensive software to reliably assign taxon labels to microbial genomes. CAMITAX is available under the Apache License 2.0 at: https://github.com/CAMI-challenge/CAMITAX

Download Full-text

GalaxyCloudRunner: enhancing scalable computing for Galaxy

10.1101/2020.05.28.121772 ◽

2020 ◽

Author(s):

N Goonasekera ◽

A Mahmoud ◽

J Chilton ◽

E Afgan

Keyword(s):

Source Code ◽

Supplementary Information ◽

Scalable Computing ◽

Link Type ◽

Cloud Providers ◽

Galaxy Server ◽

Cloud Resources

AbstractSummaryThe existence of more than 100 public Galaxy servers with service quotas is indicative of the need for an increased availability of compute resources for Galaxy to use. The GalaxyCloudRunner enables a Galaxy server to easily expand its available compute capacity by sending user jobs to cloud resources. User jobs are routed to the acquired resources based on a set of configurable rules and the resources can be dynamically acquired from any of 4 popular cloud providers (AWS, Azure, GCP, or OpenStack) in an automated fashion.Availability and implementationGalaxyCloudRunner is implemented in Python and leverages Docker containers. The source code is MIT licensed and available at https://github.com/cloudve/galaxycloudrunner. The documentation is available at http://gcr.cloudve.org/.ContactEnis Afgan ([email protected])Supplementary informationNone

Download Full-text