Clusterflock: A Flocking Algorithm for Isolating Congruent Phylogenomic Datasets

Mapping Intimacies ◽

10.1101/045773 ◽

2016 ◽

Author(s):

Apurva Narechania ◽

Richard Baker ◽

Rob DeSalle ◽

Barun Mathema ◽

Sergios-Orestis Kolokotronis ◽

...

Keyword(s):

Missing Data ◽

Large Scale ◽

Dna Microarrays ◽

Phylogenetic Signal ◽

Orthologous Gene ◽

Gene Families ◽

Clustering Problem ◽

A Genome ◽

Simple Rules ◽

Evolutionary Context

AbstractBackgroundCollective animal behavior such as the flocking of birds or the shoaling of fish has inspired a class of algorithms designed to optimize distance-based clusters in various applications including document analysis and DNA microarrays. In the flocking model, individual agents respond only to their immediate environment and move according to a few simple rules. After several iterations the agents self-organize and clusters emerge without the need for partitional seeds. In addition to their unsupervised nature, flocking offers several computational advantages including the potential to decrease the number of required comparisons.FindingsIn Clusterflock, we implement a flocking algorithm designed to find groups (flocks) of orthologous gene families (OGFs) that share a common evolutionary history. Pairwise distances that measure the phylogenetic incongruence between OGFs guide flock formation. We test this approach on several simulated datasets varying the number of underlying topologies, the proportion of missing data, and evolutionary rates, and show that in datasets containing high levels of missing data and rate heterogeneity, clusterflock outperforms other well-established clustering techniques. We also demonstrate its utility on a known, large-scale recombination event inStaphylococcus aureus. By isolating sets of OGFs with divergent phylogenetic signal, we can pinpoint the recombined region without forcing a pre-determined number of groupings or defining a pre-determined incongruence threshold.ConclusionsClusterflock is an open source tool that can be used to discover horizontally transferred genes, recombining areas of chromosomes, and the phylogenetic “core” of a genome. Though we use it in an evolutionary context, it is generalizable to any clustering problem. Users can write extensions to calculate any distance metric on the unit interval and use these distances to flock any type of data.

Download Full-text

Arabidopsis Genes Essential for Seedling Viability: Isolation of Insertional Mutants and Molecular Cloning

Genetics ◽

10.1093/genetics/159.4.1765 ◽

2001 ◽

Vol 159 (4) ◽

pp. 1765-1778

Author(s):

Gregory J Budziszewski ◽

Sharon Potter Lewis ◽

Lyn Wegrich Glover ◽

Jennifer Reineke ◽

Gary Jones ◽

...

Keyword(s):

Large Scale ◽

Protein Translocation ◽

Gene Families ◽

Mutant Phenotype ◽

Lethal Mutant ◽

A Genome ◽

Genes Encoding ◽

High Level ◽

Mutant Lines ◽

Genome Scale

Abstract We have undertaken a large-scale genetic screen to identify genes with a seedling-lethal mutant phenotype. From screening ~38,000 insertional mutant lines, we identified >500 seedling-lethal mutants, completed cosegregation analysis of the insertion and the lethal phenotype for >200 mutants, molecularly characterized 54 mutants, and provided a detailed description for 22 of them. Most of the seedling-lethal mutants seem to affect chloroplast function because they display altered pigmentation and affect genes encoding proteins predicted to have chloroplast localization. Although a high level of functional redundancy in Arabidopsis might be expected because 65% of genes are members of gene families, we found that 41% of the essential genes found in this study are members of Arabidopsis gene families. In addition, we isolated several interesting classes of mutants and genes. We found three mutants in the recently discovered nonmevalonate isoprenoid biosynthetic pathway and mutants disrupting genes similar to Tic40 and tatC, which are likely to be involved in chloroplast protein translocation. Finally, we directly compared T-DNA and Ac/Ds transposon mutagenesis methods in Arabidopsis on a genome scale. In each population, we found only about one-third of the insertion mutations cosegregated with a mutant phenotype.

Download Full-text

Systematic Detection of Large-Scale Multi-Gene Horizontal Transfer in Prokaryotes

Molecular Biology and Evolution ◽

10.1093/molbev/msab043 ◽

2021 ◽

Author(s):

Lina Kloub ◽

Sean Gosselin ◽

Matthew Fullmer ◽

Joerg Graf ◽

J Peter Gogarten ◽

...

Keyword(s):

Gene Transfer ◽

Large Scale ◽

Single Gene ◽

Gene Families ◽

Microbial Evolution ◽

Phylogenetic Distance ◽

Secretion Systems ◽

Type Iii Secretion Systems ◽

A Genome ◽

Conserved Gene

Abstract Horizontal gene transfer (HGT) is central to prokaryotic evolution. However, little is known about the “scale” of individual HGT events. In this work, we introduce the first computational framework to help answer the following fundamental question: How often does more than one gene get horizontally transferred in a single HGT event? Our method, called HoMer, uses phylogenetic reconciliation to infer single-gene HGT events across a given set of species/strains, employs several techniques to account for inference error and uncertainty, combines that information with gene order information from extant genomes, and uses statistical analysis to identify candidate horizontal multi-gene transfers (HMGTs) in both extant and ancestral species/strains. HoMer is highly scalable and can be easily used to infer HMGTs across hundreds of genomes. We apply HoMer to a genome-scale dataset of over 22000 gene families from 103 Aeromonas genomes and identify a large number of plausible HMGTs of various scales at both small and large phylogenetic distances. Analysis of these HMGTs reveals interesting relationships between gene function, phylogenetic distance, and frequency of multi-gene transfer. Among other insights, we find that (i) the observed relative frequency of HMGT increases as divergence between genomes increases, (ii) HMGTs often have conserved gene functions, and (iii) rare genes are frequently acquired through HMGT. We also analyze in detail HMGTs involving the zonula occludens toxin and type III secretion systems. By enabling the systematic inference of HMGTs on a large scale, HoMer will facilitate a more accurate and more complete understanding of HGT and microbial evolution.

Download Full-text

Systematic Detection of Large-Scale Multi-Gene Horizontal Transfer in Prokaryotes

10.1101/2020.08.27.270926 ◽

2020 ◽

Author(s):

Lina Kloub ◽

Sean Gosselin ◽

Matthew Fullmer ◽

Joerg Graf ◽

J. Peter Gogarten ◽

...

Keyword(s):

Gene Transfer ◽

Large Scale ◽

Single Gene ◽

Gene Families ◽

Microbial Evolution ◽

Phylogenetic Distance ◽

Secretion Systems ◽

Type Iii Secretion Systems ◽

A Genome ◽

Conserved Gene

AbstractHorizontal gene transfer (HGT) is central to prokaryotic evolution. However, little is known about the “scale” of individual HGT events. In this work, we introduce the first computational framework to help answer the following fundamental question: How often does more than one gene get horizontally transferred in a single HGT event? Our method, called HoMer, uses phylogenetic reconciliation to infer single-gene HGT events across a given set of species/strains, employs several techniques to account for inference error and uncertainty, combines that information with gene order information from extant genomes, and uses statistical analysis to identify candidate horizontal multi-gene transfers (HMGTs) in both extant and ancestral species/strains. HoMer is highly scalable and can be easily used to infer HMGTs across hundreds of genomes.We apply HoMer to a genome-scale dataset of over 22000 gene families from 103 Aeromonas genomes and identify a large number of plausible HMGTs of various scales at both small and large phylogenetic distances. Analysis of these HMGTs reveals interesting relationships between gene function, phylogenetic distance, and frequency of multi-gene transfer. Among other insights, we find that (i) the relative frequency of HMGT increases as divergence between genomes increases, (ii) HMGTs often have conserved gene functions, and (iii) rare genes are frequently acquired through HMGT. We also analyze in detail HMGTs involving the zonula occludens toxin and type III secretion systems. By enabling the systematic inference of HMGTs on a large scale, HoMer will facilitate a more accurate and more complete understanding of HGT and microbial evolution.

Download Full-text

Phylogenomic Reconstruction of the Oomycete Phylogeny Derived from 37 Genomes

mSphere ◽

10.1128/msphere.00095-17 ◽

2017 ◽

Vol 2 (2) ◽

Cited By ~ 32

Author(s):

Charley G. P. McCarthy ◽

David A. Fitzpatrick

Keyword(s):

Large Scale ◽

Plant Pathogens ◽

Single Gene ◽

Genomic Data ◽

Gene Families ◽

Phylogenomic Analysis ◽

Phylogenetic Studies ◽

A Genome ◽

Supertree Methods ◽

Genome Scale

ABSTRACT The oomycetes are a class of eukaryotes and include ecologically significant animal and plant pathogens. Single-gene and multigene phylogenetic studies of individual oomycete genera and of members of the larger classes have resulted in conflicting conclusions concerning interspecies relationships among these species, particularly for the Phytophthora genus. The onset of next-generation sequencing techniques now means that a wealth of oomycete genomic data is available. For the first time, we have used genome-scale phylogenetic methods to resolve oomycete phylogenetic relationships. We used supertree methods to generate single-gene and multigene species phylogenies. Overall, our supertree analyses utilized phylogenetic data from 8,355 oomycete gene families. We have also complemented our analyses with superalignment phylogenies derived from 131 single-copy ubiquitous gene families. Our results show that a genome-scale approach to oomycete phylogeny resolves oomycete classes and clades. Our analysis represents an important first step in large-scale phylogenomic analysis of the oomycetes. The oomycetes are a class of microscopic, filamentous eukaryotes within the Stramenopiles-Alveolata-Rhizaria (SAR) supergroup which includes ecologically significant animal and plant pathogens, most infamously the causative agent of potato blight Phytophthora infestans. Single-gene and concatenated phylogenetic studies both of individual oomycete genera and of members of the larger class have resulted in conflicting conclusions concerning species phylogenies within the oomycetes, particularly for the large Phytophthora genus. Genome-scale phylogenetic studies have successfully resolved many eukaryotic relationships by using supertree methods, which combine large numbers of potentially disparate trees to determine evolutionary relationships that cannot be inferred from individual phylogenies alone. With a sufficient amount of genomic data now available, we have undertaken the first whole-genome phylogenetic analysis of the oomycetes using data from 37 oomycete species and 6 SAR species. In our analysis, we used established supertree methods to generate phylogenies from 8,355 homologous oomycete and SAR gene families and have complemented those analyses with both phylogenomic network and concatenated supermatrix analyses. Our results show that a genome-scale approach to oomycete phylogeny resolves oomycete classes and individual clades within the problematic Phytophthora genus. Support for the resolution of the inferred relationships between individual Phytophthora clades varies depending on the methodology used. Our analysis represents an important first step in large-scale phylogenomic analysis of the oomycetes. IMPORTANCE The oomycetes are a class of eukaryotes and include ecologically significant animal and plant pathogens. Single-gene and multigene phylogenetic studies of individual oomycete genera and of members of the larger classes have resulted in conflicting conclusions concerning interspecies relationships among these species, particularly for the Phytophthora genus. The onset of next-generation sequencing techniques now means that a wealth of oomycete genomic data is available. For the first time, we have used genome-scale phylogenetic methods to resolve oomycete phylogenetic relationships. We used supertree methods to generate single-gene and multigene species phylogenies. Overall, our supertree analyses utilized phylogenetic data from 8,355 oomycete gene families. We have also complemented our analyses with superalignment phylogenies derived from 131 single-copy ubiquitous gene families. Our results show that a genome-scale approach to oomycete phylogeny resolves oomycete classes and clades. Our analysis represents an important first step in large-scale phylogenomic analysis of the oomycetes.

Download Full-text

VESPA: Very large-scale Evolutionary and Selective Pressure Analyses

PeerJ Computer Science ◽

10.7717/peerj-cs.118 ◽

2017 ◽

Vol 3 ◽

pp. e118 ◽

Cited By ~ 10

Author(s):

Andrew E. Webb ◽

Thomas A. Walsh ◽

Mary J. O’Connell

Keyword(s):

Phylogenetic Trees ◽

Large Scale ◽

Selective Pressure ◽

Gene Families ◽

Pressure Variation ◽

Phylogeny Reconstruction ◽

Protein Coding ◽

Coding Sequences ◽

A Genome ◽

Pressure Analysis

Background Large-scale molecular evolutionary analyses of protein coding sequences requires a number of preparatory inter-related steps from finding gene families, to generating alignments and phylogenetic trees and assessing selective pressure variation. Each phase of these analyses can represent significant challenges, particularly when working with entire proteomes (all protein coding sequences in a genome) from a large number of species. Methods We present VESPA, software capable of automating a selective pressure analysis using codeML in addition to the preparatory analyses and summary statistics. VESPA is written in python and Perl and is designed to run within a UNIX environment. Results We have benchmarked VESPA and our results show that the method is consistent, performs well on both large scale and smaller scale datasets, and produces results in line with previously published datasets. Discussion Large-scale gene family identification, sequence alignment, and phylogeny reconstruction are all important aspects of large-scale molecular evolutionary analyses. VESPA provides flexible software for simplifying these processes along with downstream selective pressure variation analyses. The software automatically interprets results from codeML and produces simplified summary files to assist the user in better understanding the results. VESPA may be found at the following website: http://www.mol-evol.org/VESPA.

Download Full-text

Genetic variation in recombination rate in the pig

Genetics Selection Evolution ◽

10.1186/s12711-021-00643-0 ◽

2021 ◽

Vol 53 (1) ◽

Author(s):

Martin Johnsson ◽

Andrew Whalen ◽

Roger Ros-Freixedes ◽

Gregor Gorjanc ◽

Ching-Yi Chen ◽

...

Keyword(s):

Genetic Variation ◽

Recombination Rate ◽

Large Scale ◽

Genome Wide Association Study ◽

Genetic Material ◽

A Genome ◽

Pig Genome ◽

Genetic Length ◽

Trait Locus ◽

Genomic Regions

Abstract Background Meiotic recombination results in the exchange of genetic material between homologous chromosomes. Recombination rate varies between different parts of the genome, between individuals, and is influenced by genetics. In this paper, we assessed the genetic variation in recombination rate along the genome and between individuals in the pig using multilocus iterative peeling on 150,000 individuals across nine genotyped pedigrees. We used these data to estimate the heritability of recombination and perform a genome-wide association study of recombination in the pig. Results Our results confirmed known features of the recombination landscape of the pig genome, including differences in genetic length of chromosomes and marked sex differences. The recombination landscape was repeatable between lines, but at the same time, there were differences in average autosome-wide recombination rate between lines. The heritability of autosome-wide recombination rate was low but not zero (on average 0.07 for females and 0.05 for males). We found six genomic regions that are associated with recombination rate, among which five harbour known candidate genes involved in recombination: RNF212, SHOC1, SYCP2, MSH4 and HFM1. Conclusions Our results on the variation in recombination rate in the pig genome agree with those reported for other vertebrates, with a low but nonzero heritability, and the identification of a major quantitative trait locus for recombination rate that is homologous to that detected in several other species. This work also highlights the utility of using large-scale livestock data to understand biological processes.

Download Full-text

Kernel weighted least square approach for imputing missing values of metabolomics data

Scientific Reports ◽

10.1038/s41598-021-90654-0 ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Nishith Kumar ◽

Md. Aminul Hoque ◽

Masahiro Sugimoto

Keyword(s):

Missing Data ◽

Large Scale ◽

Missing Values ◽

Kernel Weight ◽

Least Square ◽

Data Matrix ◽

Data Imputation ◽

Metabolomics Data ◽

Missing Value ◽

Missing Data Imputation

AbstractMass spectrometry is a modern and sophisticated high-throughput analytical technique that enables large-scale metabolomic analyses. It yields a high-dimensional large-scale matrix (samples × metabolites) of quantified data that often contain missing cells in the data matrix as well as outliers that originate for several reasons, including technical and biological sources. Although several missing data imputation techniques are described in the literature, all conventional existing techniques only solve the missing value problems. They do not relieve the problems of outliers. Therefore, outliers in the dataset decrease the accuracy of the imputation. We developed a new kernel weight function-based proposed missing data imputation technique that resolves the problems of missing values and outliers. We evaluated the performance of the proposed method and other conventional and recently developed missing imputation techniques using both artificially generated data and experimentally measured data analysis in both the absence and presence of different rates of outliers. Performances based on both artificial data and real metabolomics data indicate the superiority of our proposed kernel weight-based missing data imputation technique to the existing alternatives. For user convenience, an R package of the proposed kernel weight-based missing value imputation technique was developed, which is available at https://github.com/NishithPaul/tWLSA.

Download Full-text

De Novo Genome Assembly of Limpet Bathyacmaea lactea (Gastropoda: Pectinodontidae): The First Reference Genome of a Deep-Sea Gastropod Endemic to Cold Seeps

Genome Biology and Evolution ◽

10.1093/gbe/evaa100 ◽

2020 ◽

Vol 12 (6) ◽

pp. 905-910 ◽

Cited By ~ 2

Author(s):

Ruoyu Liu ◽

Kun Wang ◽

Jun Liu ◽

Wenjie Xu ◽

Yang Zhou ◽

...

Keyword(s):

Deep Sea ◽

Metal Ion ◽

De Novo ◽

Demographic History ◽

Gene Families ◽

Phylogenetic Position ◽

Cold Seeps ◽

Nitrogen And Phosphorus ◽

De Novo Genome Assembly ◽

A Genome

Abstract Cold seeps, characterized by the methane, hydrogen sulfide, and other hydrocarbon chemicals, foster one of the most widespread chemosynthetic ecosystems in deep sea that are densely populated by specialized benthos. However, scarce genomic resources severely limit our knowledge about the origin and adaptation of life in this unique ecosystem. Here, we present a genome of a deep-sea limpet Bathyacmaea lactea, a common species associated with the dominant mussel beds in cold seeps. We yielded 54.6 gigabases (Gb) of Nanopore reads and 77.9-Gb BGI-seq raw reads, respectively. Assembly harvested a 754.3-Mb genome for B. lactea, with 3,720 contigs and a contig N50 of 1.57 Mb, covering 94.3% of metazoan Benchmarking Universal Single-Copy Orthologs. In total, 23,574 protein-coding genes and 463.4 Mb of repetitive elements were identified. We analyzed the phylogenetic position, substitution rate, demographic history, and TE activity of B. lactea. We also identified 80 expanded gene families and 87 rapidly evolving Gene Ontology categories in the B. lactea genome. Many of these genes were associated with heterocyclic compound metabolism, membrane-bounded organelle, metal ion binding, and nitrogen and phosphorus metabolism. The high-quality assembly and in-depth characterization suggest the B. lactea genome will serve as an essential resource for understanding the origin and adaptation of life in the cold seeps.

Download Full-text

Imaging and Cognitive Genetics: The Norwegian Cognitive NeuroGenetics Sample

Twin Research and Human Genetics ◽

10.1017/thg.2012.8 ◽

2012 ◽

Vol 15 (3) ◽

pp. 442-452 ◽

Cited By ~ 28

Author(s):

Thomas Espeseth ◽

Andrea Christoforou ◽

Astri J. Lundervold ◽

Vidar M. Steen ◽

Stephanie Le Hellard ◽

...

Keyword(s):

Sample Size ◽

Cognitive Aging ◽

Brain Function ◽

Large Scale ◽

Adult Life ◽

Aging Brain ◽

Adult Life Span ◽

Cross Sectional ◽

A Genome ◽

Effects Of Aging

Data collection for the Norwegian Cognitive NeuroGenetics sample (NCNG) was initiated in 2003 with a research grant (to Ivar Reinvang) to study cognitive aging, brain function, and genetic risk factors. The original focus was on the effects of aging (from middle age and up) and candidate genes (e.g., APOE, CHRNA4) in cross-sectional and longitudinal designs, with the cognitive and MRI-based data primarily being used for this purpose. However, as the main topic of the project broadened from cognitive aging to imaging and cognitive genetics more generally, the sample size, age range of the participants, and scope of available phenotypes and genotypes, have developed beyond the initial project. In 2009, a genome-wide association (GWA) study was undertaken, and the NCNG proper was established to study the genetics of cognitive and brain function more comprehensively. The NCNG is now controlled by the NCNG Study Group, which consists of the present authors. Prominent features of the NCNG are the adult life-span coverage of healthy participants with high-dimensional imaging, and cognitive data from a genetically homogenous sample. Another unique property is the large-scale (sample size 300–700) use of experimental cognitive tasks focusing on attention and working memory. The NCNG data is now used in numerous ongoing GWA-based studies and has contributed to several international consortia on imaging and cognitive genetics. The objective of the following presentation is to give other researchers the information necessary to evaluate possible contributions from the NCNG to various multi-sample data analyses.

Download Full-text

An International Campaign for Agricultural and Livestock Genomics (CALG)

Asia-Pacific Biotech News ◽

10.1142/s0219030302001970 ◽

2002 ◽

Vol 06 (24) ◽

pp. 958-965

Author(s):

Jun Yu ◽

Jian Wang ◽

Huanming Yang

Keyword(s):

Large Scale ◽

Cost Effective ◽

Model Organisms ◽

Environmental Biology ◽

Cdna Sequences ◽

Governmental Agencies ◽

Technology Innovations ◽

A Genome ◽

Starting Point ◽

The Cost

A coordinated international effort to sequence agricultural and livestock genomes has come to its time. While human genome and genomes of many model organisms (related to human health and basic biological interests) have been sequenced or plugged in the sequencing pipelines, agronomically important crop and livestock genomes have not been given high enough priority. Although we are facing many challenges in policy-making, grant funding, regional task emphasis, research community consensus and technology innovations, many initiatives are being announced and formulated based on the cost-effective and large-scale sequencing procedure, known as whole genome shotgun (WGS) sequencing that produces draft sequences covering a genome from 95 percent to 99 percent. Identified genes from such draft sequences, coupled with other resources, such as molecular markers, large-insert clones and cDNA sequences, provide ample information and tools to further our knowledge in agricultural and environmental biology in the genome era that just comes to its accelerated period. If the campaign succeeds, molecular biologists, geneticists and field biologists from all countries, rich or poor, would be brought to the same starting point and expect another astronomical increase of basic genomic information, ready to convert effectively into knowledge that will ultimately change our lives and environment into a greater and better future. We call upon national and international governmental agencies and organizations as well as research foundations to support this unprecedented movement.

Download Full-text