Computational Framework for High-Quality Production and Large-Scale Evolutionary Analysis of Metagenome Assembled Genomes

Boštjan Murovec; Leon Deutsch; Blaz Stres

doi:10.1093/molbev/msz237

Computational Framework for High-Quality Production and Large-Scale Evolutionary Analysis of Metagenome Assembled Genomes

Molecular Biology and Evolution ◽

10.1093/molbev/msz237 ◽

2019 ◽

Vol 37 (2) ◽

pp. 593-598

Author(s):

Boštjan Murovec ◽

Leon Deutsch ◽

Blaz Stres

Keyword(s):

Large Scale ◽

Evolutionary Dynamics ◽

Underlying Structure ◽

Marker Genes ◽

Data Sets ◽

Evolutionary Analysis ◽

High Quality ◽

Computational Framework ◽

Open Source Software Package ◽

Metagenome Assembly

Abstract Microbial species play important roles in different environments and the production of high-quality genomes from metagenome data sets represents a major obstacle to understanding their ecological and evolutionary dynamics. Metagenome-Assembled Genomes Orchestra (MAGO) is a computational framework that integrates and simplifies metagenome assembly, binning, bin improvement, bin quality (completeness and contamination), bin annotation, and evolutionary placement of bins via detailed maximum-likelihood phylogeny based on multiple marker genes using different amino acid substitution models, next to average nucleotide identity analysis of genomes for delineation of species boundaries and operational taxonomic units. MAGO offers streamlined execution of the entire metagenomics pipeline, error checking, computational resource distribution and compatibility of data formats, governed by user-tailored pipeline processing. MAGO is an open-source-software package released in three different ways, as a singularity image and a Docker container for HPC purposes as well as for running MAGO on a commodity hardware, and a virtual machine for gaining a full access to MAGO underlying structure and source code. MAGO is open to suggestions for extensions and is amenable for use in both research and teaching of genomics and molecular evolution of genomes assembled from small single-cell projects or large-scale and complex environmental metagenomes.

Download Full-text

Comparative evolutionary analysis and prediction of deleterious mutation patterns between sorghum and maize

10.1101/777623 ◽

2019 ◽

Cited By ~ 3

Author(s):

Roberto Lozano ◽

Elodie Gazave ◽

Jhonathan P.R. dos Santos ◽

Markus Stetter ◽

Ravi Valluru ◽

...

Keyword(s):

Comparative Genomics ◽

Evolutionary History ◽

Large Scale ◽

Evolutionary Dynamics ◽

Deleterious Mutation ◽

Genomic Variation ◽

Evolutionary Analysis ◽

Deleterious Mutations ◽

Maize Population ◽

Whole Genome Resequencing

AbstractSorghum and maize share a close evolutionary history that can be explored through comparative genomics. To perform a large-scale comparison of the genomic variation between these two species, we analyzed 13 million variants identified from whole genome resequencing of 468 sorghum lines together with 25 million variants previously identified in 1,218 maize lines. Deleterious mutations in both species were prevalent in pericentromeric regions, enriched in non-syntenic genes, and present at low allele frequencies. A comparison of deleterious burden between sorghum and maize revealed that sorghum, in contrast to maize, departed from the “domestication cost” hypothesis that predicts a higher deleterious burden among domesticates compared to wild lines. Additionally, sorghum and maize population genetic summary statistics were used to predict a gene deleterious index with an accuracy higher than 0.5. This research represents a key step towards understanding the evolutionary dynamics of deleterious variants in sorghum and provides a comparative genomics framework to start prioritizing them for removal through genome editing and breeding.

Download Full-text

Large-Scale Metagenome Assembly Reveals Novel Animal-Associated Microbial Genomes, Biosynthetic Gene Clusters, and Other Genetic Diversity

mSystems ◽

10.1128/msystems.01045-20 ◽

2020 ◽

Vol 5 (6) ◽

Author(s):

Nicholas D. Youngblut ◽

Jacobo de la Cuesta-Zuluaga ◽

Georg H. Reischer ◽

Silke Dauser ◽

Nathalie Schuster ◽

...

Keyword(s):

Large Scale ◽

Animal Species ◽

Gene Clusters ◽

Genomic Diversity ◽

Data Sets ◽

Biosynthetic Gene ◽

Biosynthetic Gene Clusters ◽

Microbial Genomes ◽

Metagenome Assembly ◽

Gut Metagenome

ABSTRACT Large-scale metagenome assemblies of human microbiomes have produced a vast catalogue of previously unseen microbial genomes; however, comparatively few microbial genomes derive from other vertebrates. Here, we generated 5,596 metagenome-assembled genomes (MAGs) from the gut metagenomes of 180 predominantly wild animal species representing 5 classes, in addition to 14 existing animal gut metagenome data sets. The MAGs comprised 1,522 species-level genome bins (SGBs), most of which were novel at the species, genus, or family level, and the majority were enriched in host versus environment metagenomes. Many traits distinguished SGBs enriched in host or environmental biomes, including the number of antimicrobial resistance genes. We identified 1,986 diverse biosynthetic gene clusters; only 23 clustered with any MIBiG database references. Gene-based assembly revealed tremendous gene diversity, much of it host or environment specific. Our MAG and gene data sets greatly expand the microbial genome repertoire and provide a broad view of microbial adaptations to the vertebrate gut. IMPORTANCE Microbiome studies on a select few mammalian species (e.g., humans, mice, and cattle) have revealed a great deal of novel genomic diversity in the gut microbiome. However, little is known of the microbial diversity in the gut of other vertebrates. We studied the gut microbiomes of a large set of mostly wild animal species consisting of mammals, birds, reptiles, amphibians, and fish. Unfortunately, we found that existing reference databases commonly used for metagenomic analyses failed to capture the microbiome diversity among vertebrates. To increase database representation, we applied advanced metagenome assembly methods to our animal gut data and to many public gut metagenome data sets that had not been used to obtain microbial genomes. Our resulting genome and gene cluster collections comprised a great deal of novel taxonomic and genomic diversity, which we extensively characterized. Our findings substantially expand what is known of microbial genomic diversity in the vertebrate gut.

Download Full-text

Imaging improvements from subsalt 3D VSP acquisitions in the Gulf of Mexico

The Leading Edge ◽

10.1190/tle38110865.1 ◽

2019 ◽

Vol 38 (11) ◽

pp. 865-871 ◽

Cited By ~ 2

Author(s):

Jean-Paul van Gestel ◽

Ken Hartman ◽

Corey Joy ◽

Qingsong Li ◽

Michael Pfister ◽

...

Keyword(s):

Gulf Of Mexico ◽

Seismic Data ◽

Large Scale ◽

Survey Design ◽

Seismic Profile ◽

Lessons Learned ◽

Data Sets ◽

High Quality ◽

Business Decisions ◽

Vsp Data

From 2015 through 2018, BP acquired six large-scale 3D vertical seismic profile (VSP) data sets at their Gulf of Mexico assets, two at each of the Thunder Horse, Mad Dog, and Atlantis fields. The acquisition of these large-scale data sets was enabled by the development of a 100-level wireline tool and the adoption of simultaneous shooting. With those two developments, it became feasible to acquire data sets with the coverage and data density needed to build high-quality images of the subsurface using 3D VSP acquisitions. There have been recent advances in finite difference modeling to guide the survey design and the high-quality processing that is required to create the 3D VSP image volumes. These volumes have two main advantages over conventional surface seismic data. First, in 3D VSP acquisition, the receiver can be located below the overlying salt bodies, which allows for illumination of the reservoirs that cannot be achieved using surface seismic data. Second, the location of the receivers closer to the imaging targets enables higher frequency content of the resulting VSP data compared to conventional surface seismic images. Both imaging enhancements can have a significant business value, and the resulting VSP data sets have demonstrated a clear impact on business decisions. In the three case studies, we demonstrate the business impact of the 3D VSP data acquired through improvement of imaging of stratigraphic edges, improved interpretation of fault geometry and orientation, and related improvement of the quality of well planning and targeting. We conclude with discussion on cost, global impact, and present recommendations and lessons learned for future surveys.

Download Full-text

NetworKit: A tool suite for large-scale complex network analysis

Network Science ◽

10.1017/nws.2016.20 ◽

2016 ◽

Vol 4 (4) ◽

pp. 508-530 ◽

Cited By ~ 48

Author(s):

CHRISTIAN L. STAUDT ◽

ALEKSEJS SAZONOVS ◽

HENNING MEYERHENKE

Keyword(s):

Network Analysis ◽

Large Scale ◽

Experimental Comparison ◽

Algorithm Engineering ◽

Data Sets ◽

Domain Experts ◽

Modular Software ◽

Wide Range ◽

Open Source Software Package ◽

Efficient Data

AbstractWe introduce NetworKit, an open-source software package for analyzing the structure of large complex networks. Appropriate algorithmic solutions are required to handle increasingly common large graph data sets containing up to billions of connections. We describe the methodology applied to develop scalable solutions to network analysis problems, including techniques like parallelization, heuristics for computationally expensive problems, efficient data structures, and modular software architecture. Our goal for the software is to package results of our algorithm engineering efforts and put them into the hands of domain experts. NetworKit is implemented as a hybrid combining the kernels written in C++ with a Python frontend, enabling integration into the Python ecosystem of tested tools for data analysis and scientific computing. The package provides a wide range of functionality (including common and novel analytics algorithms and graph generators) and does so via a convenient interface. In an experimental comparison with related software, NetworKit shows the best performance on a range of typical analysis tasks.

Download Full-text

Sequential Parameter Estimation for Large-Scale Systems with Multiple Data Sets. 1. Computational Framework

Industrial & Engineering Chemistry Research ◽

10.1021/ie030296s ◽

2003 ◽

Vol 42 (23) ◽

pp. 5850-5860 ◽

Cited By ~ 29

Author(s):

Richard Faber ◽

Pu Li ◽

Günter Wozny

Keyword(s):

Parameter Estimation ◽

Large Scale ◽

Data Sets ◽

Computational Framework ◽

Large Scale Systems ◽

Multiple Data ◽

Multiple Data Sets

Download Full-text

Faculty Opinions recommendation of Comparative assessment of large-scale data sets of protein-protein interactions.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.1006598.82257 ◽

2002 ◽

Author(s):

Rob Russell

Keyword(s):

Protein Interactions ◽

Large Scale ◽

Comparative Assessment ◽

Data Sets ◽

Protein Protein Interactions ◽

Large Scale Data ◽

Scale Data ◽

Large Scale Data Sets

Download Full-text

PRISMS-Fatigue computational framework for fatigue analysis in polycrystalline metals and alloys

npj Computational Materials ◽

10.1038/s41524-021-00506-8 ◽

2021 ◽

Vol 7 (1) ◽

Author(s):

Mohammadreza Yaghoobi ◽

Krzysztof S. Stopka ◽

Aaditya Lakshmanan ◽

Veera Sundararaghavan ◽

John E. Allison ◽

...

Keyword(s):

Open Source ◽

Open Source Software ◽

Large Scale ◽

Metals And Alloys ◽

Analysis Tool ◽

Computational Framework ◽

Crystal Plasticity Finite Element ◽

Polycrystalline Metals ◽

Simulation Based ◽

Open Source Framework

AbstractThe PRISMS-Fatigue open-source framework for simulation-based analysis of microstructural influences on fatigue resistance for polycrystalline metals and alloys is presented here. The framework uses the crystal plasticity finite element method as its microstructure analysis tool and provides a highly efficient, scalable, flexible, and easy-to-use ICME community platform. The PRISMS-Fatigue framework is linked to different open-source software to instantiate microstructures, compute the material response, and assess fatigue indicator parameters. The performance of PRISMS-Fatigue is benchmarked against a similar framework implemented using ABAQUS. Results indicate that the multilevel parallelism scheme of PRISMS-Fatigue is more efficient and scalable than ABAQUS for large-scale fatigue simulations. The performance and flexibility of this framework is demonstrated with various examples that assess the driving force for fatigue crack formation of microstructures with different crystallographic textures, grain morphologies, and grain numbers, and under different multiaxial strain states, strain magnitudes, and boundary conditions.

Download Full-text

Galaxy spin direction distribution in HST and SDSS show similar large-scale asymmetry

Publications of the Astronomical Society of Australia ◽

10.1017/pasa.2020.46 ◽

2020 ◽

Vol 37 ◽

Author(s):

Lior Shamir

Keyword(s):

Large Scale ◽

Spiral Galaxies ◽

Hubble Space Telescope ◽

Gravitational Interaction ◽

Large Data ◽

Sloan Digital Sky Survey ◽

Data Sets ◽

Dipole Axis ◽

Data Set ◽

The Asymmetry

Abstract Several recent observations using large data sets of galaxies showed non-random distribution of the spin directions of spiral galaxies, even when the galaxies are too far from each other to have gravitational interaction. Here, a data set of $\sim8.7\cdot10^3$ spiral galaxies imaged by Hubble Space Telescope (HST) is used to test and profile a possible asymmetry between galaxy spin directions. The asymmetry between galaxies with opposite spin directions is compared to the asymmetry of galaxies from the Sloan Digital Sky Survey. The two data sets contain different galaxies at different redshift ranges, and each data set was annotated using a different annotation method. The results show that both data sets show a similar asymmetry in the COSMOS field, which is covered by both telescopes. Fitting the asymmetry of the galaxies to cosine dependence shows a dipole axis with probabilities of $\sim2.8\sigma$ and $\sim7.38\sigma$ in HST and SDSS, respectively. The most likely dipole axis identified in the HST galaxies is at $(\alpha=78^{\rm o},\delta=47^{\rm o})$ and is well within the $1\sigma$ error range compared to the location of the most likely dipole axis in the SDSS galaxies with $z>0.15$ , identified at $(\alpha=71^{\rm o},\delta=61^{\rm o})$ .

Download Full-text

Tiered Sampling

ACM Transactions on Knowledge Discovery from Data ◽

10.1145/3441299 ◽

2021 ◽

Vol 15 (5) ◽

pp. 1-52

Author(s):

Lorenzo De Stefani ◽

Erisa Terolli ◽

Eli Upfal

Keyword(s):

Large Scale ◽

Analysis Of Algorithms ◽

Base Layer ◽

Single Edge ◽

Real World Data ◽

High Quality ◽

Large Graphs ◽

Massive Graphs ◽

Variance Estimate ◽

Low Probability

We introduce Tiered Sampling , a novel technique for estimating the count of sparse motifs in massive graphs whose edges are observed in a stream. Our technique requires only a single pass on the data and uses a memory of fixed size M , which can be magnitudes smaller than the number of edges. Our methods address the challenging task of counting sparse motifs—sub-graph patterns—that have a low probability of appearing in a sample of M edges in the graph, which is the maximum amount of data available to the algorithms in each step. To obtain an unbiased and low variance estimate of the count, we partition the available memory into tiers (layers) of reservoir samples. While the base layer is a standard reservoir sample of edges, other layers are reservoir samples of sub-structures of the desired motif. By storing more frequent sub-structures of the motif, we increase the probability of detecting an occurrence of the sparse motif we are counting, thus decreasing the variance and error of the estimate. While we focus on the designing and analysis of algorithms for counting 4-cliques, we present a method which allows generalizing Tiered Sampling to obtain high-quality estimates for the number of occurrence of any sub-graph of interest, while reducing the analysis effort due to specific properties of the pattern of interest. We present a complete analytical analysis and extensive experimental evaluation of our proposed method using both synthetic and real-world data. Our results demonstrate the advantage of our method in obtaining high-quality approximations for the number of 4 and 5-cliques for large graphs using a very limited amount of memory, significantly outperforming the single edge sample approach for counting sparse motifs in large scale graphs.

Download Full-text

Multi-view feature selection for identifying gene markers: a diversified biological data driven approach

BMC Bioinformatics ◽

10.1186/s12859-020-03810-0 ◽

2020 ◽

Vol 21 (S18) ◽

Author(s):

Sudipta Acharya ◽

Laizhong Cui ◽

Yi Pan

Keyword(s):

Gene Expression ◽

Feature Selection ◽

Gene Selection ◽

Marker Gene ◽

Biological Data ◽

Protein Interaction Data ◽

Marker Genes ◽

Data Sets ◽

Gene Markers ◽

Multi Objective

Abstract Background In recent years, to investigate challenging bioinformatics problems, the utilization of multiple genomic and proteomic sources has become immensely popular among researchers. One such issue is feature or gene selection and identifying relevant and non-redundant marker genes from high dimensional gene expression data sets. In that context, designing an efficient feature selection algorithm exploiting knowledge from multiple potential biological resources may be an effective way to understand the spectrum of cancer or other diseases with applications in specific epidemiology for a particular population. Results In the current article, we design the feature selection and marker gene detection as a multi-view multi-objective clustering problem. Regarding that, we propose an Unsupervised Multi-View Multi-Objective clustering-based gene selection approach called UMVMO-select. Three important resources of biological data (gene ontology, protein interaction data, protein sequence) along with gene expression values are collectively utilized to design two different views. UMVMO-select aims to reduce gene space without/minimally compromising the sample classification efficiency and determines relevant and non-redundant gene markers from three cancer gene expression benchmark data sets. Conclusion A thorough comparative analysis has been performed with five clustering and nine existing feature selection methods with respect to several internal and external validity metrics. Obtained results reveal the supremacy of the proposed method. Reported results are also validated through a proper biological significance test and heatmap plotting.

Download Full-text